Editing Core Offloads (section)

=== Multi Queue ===

A device MUST support from 1 up to 1024 logical queues per device. Number of queues MUST be host configurable. It is acceptable to require the device link to be brought down to reconfigure queue count.

<span id="independent-transmit-and-receive-queues"></span>
===== Independent Transmit and Receive Queues =====

The number of Tx and Rx queues MUST be configurable independently. No relationship between the two should be assumed.

Receive and transmit processing generate CPU cycle cost in different ways. The interrupt moderation section compares interrupt frequency trade-offs. Transmit cost is sensitive to cacheline and lock contention when a Tx queue is shared between CPUs. Something that does not happen on receive. Thus Tx and Rx can have different optimal numbers of queues.

<span id="flow-steering"></span>
==== Flow Steering ====

With multiple receive queues, the network interface needs to implement queue selection. It MUST support RSS load balancing and MAY advertise accelerated RFS or programmable flow steering. If it advertises either, then that implementation MUST follow the feature requirements defined here.

<span id="receive-side-scaling"></span>
===== Receive Side Scaling =====

A device MUST support load balancing with flow affinity using Receive Side Scaling (RSS). This algorithm combines (a) field extraction rules for packet steering with flow affinity, (b) a hash function for uniform load balancing that incorporates a secondary hash input for DoS resistance and (c) an indirection table to optionally implement non-uniform weighted load balancing.

<span id="field-extraction"></span>
====== Field Extraction ======

Queue selection must be flow affine, forwarding all packets from a transport flow to the same queue, so that packets within a flow are not reordered. Transport protocol performance can degrade when packets arrive out of order, which is likely to happen with simpler round robin packet spraying.

RSS defines two rules to derive queue selection input in a flow-affine manner from packet headers. Selected fields of the headers are extracted and concatenated into a byte array. If the packet is IPv4 or IPv6, not fragmented, and followed by a transport layer protocol with ports, such as TCP and UDP, then extract the concatenated 4-field byte array { source address, destination address, source port, destination port }. Else, if the packet is IPv4 or IPv6, extract 2-field byte array { source address, destination address }. IPv4 packets are considered fragmented if the more fragments bit is set or the fragment offset field is non-zero.

If a packet contains multiple IPv4 or IPv6 headers, then RSS operates on the first IPv4 or IPv6 header and the immediately following transport header, if any. The IPv6 flowlabel field may also be included. If present, this MUST be placed immediately before the source address in the byte array.

The same fields from subsequent IPv4/IPv6 and transport headers MAY be appended to the byte array, if present. These extensions are optional and MUST be configurable if supported. A basic version of RSS without optional extensions MUST always be supported, to be able to perform explicit flow steering by reversing the algorithm.

<span id="toeplitz-hash-function"></span>
====== Toeplitz Hash Function ======

The device MUST support the Toeplitz hash function~[ref_id:toeplitz_hash] for Receive Side Scaling~[ref_id:ms_rss]. A hash function maps the byte array onto a 32-bit number with significant entropy to serve as effective input for uniform load balancing.

The Toeplitz function takes two inputs, the byte array derived from the packet P and a second byte array S of at least 40 bytes that is constant between packets. S, called the secret, is mixed into the entropy for DoS prevention. It makes queue prediction hard for a given packet unless the secret is known. Toeplitz trades off execution speed and security: it is not a cryptographically secure hash function. The secret MUST be readable and writable from the host. Explicit queue prediction has legitimate use cases, so the secret must be discoverable by trusted parties.

The secret S is converted to a Toeplitz matrix Sm, a matrix in which each left-to-right descending diagonal is constant. Due to this property, a Toeplitz matrix is fully defined by its first row and column. An N * M Toeplitz matrix is defined by an N + M - 1 length source vector.

S has to be long enough to match against each bit in the longest possible RSS input vector. That is 2 16B IPv6 addresses plus 2 2B ports, for 36B == 288b.

A minimal RSS Toeplitz matrix is a binary Toeplitz matrix of 288 x 32 bits. 288 is one row for each bit in P. 288 + 32 - 1 == 319 bits to define the matrix establishes the minimum 40B required length of secret S. The minimum secret key length MUST be no less than 40B. Due to optional extended inputs, larger secrets MAY be supported. A range of 40-60B is common. Matching MUST always begin at bit zero, regardless of configured key length. The first row consists of the left-most 32 bits of the array. The remainder define the first bit of each subsequent row.

The Toeplitz hash function performs a scalar multiplication between the Toeplitz matrix Sm and the input array P. Each bit in P<sub>i</sub> is multiplied with the 32b row Sm<sub>i</sub>. The output array O is converted to a scalar value by an XOR of all elements in O. The below reference implementation demonstrates the algorithm. Care must be taken surrounding endianness and bit-order (traverse a byte from MSB to LSB). See Appendix B for validation.

<pre>    uint32_t toeplitz(const unsigned char *P,
                      const unsigned char *S)
    {
        uint32_t rxhash = 0;
        int bit;

        for (bit = 0; bit &lt; 288; bit++)
            if (test_bit(P, bit))
                rxhash ^= word_at_bit_offset(S, bit);

        return rxhash;</pre>
<pre>     }</pre>
The device MAY support other hash functions besides Toeplitz. Then function selection must be configurable.

<span id="receive-hash"></span>
====== Receive Hash ======

The computed 32b hash SHOULD be passed to the host alongside the packet. Doing so allows the host to perform additional flow steering without having to compute a hash in software, such as Linux Receive Flow Steering (RFS).

A device MAY compute a 64b field to reduce collisions. It MAY communicate this instead, as long as either the 32b Toeplitz hash can be derived or can be communicated alongside.

<span id="indirection-table"></span>
====== Indirection Table ======

The device MUST select a queue by reducing the hash through modulo arithmetic. It applies division to the hash value and uses the remainder as an index into a fixed number of resources. The divisor is not simply the number of receive queues. RSS specifies an additional level of indirection, the indirection table. This allows for non-uniform load balancing. The device MUST support the RSS indirection table. The device MUST lookup a queue using the following modulo operation:

<pre>queue_id = rss_table[rxhash % rss_table_length];</pre>
The table MUST be host-readable and writable. The host may configure the table with fewer slots than the configured number of receive queues, if the host wants to apply RSS to only a subset of queues. The host may configure the table with more slots than the number of receive queues, for more uniform load balancing. The device may limit the maximum supported table size. The minimum supported indirection table size MUST be <s>the number of supported receive queues</s>. The minimum SHOULD be at least 4 times the number of supported receive queues. The device SHOULD allow querying the maximum supported table size by the host. The device SHOULD allow replacement of the indirection table without pausing network traffic or bringing the device down, to support dynamic rebalancing, e.g., based on CPU load.

<span id="accelerated-rfs"></span>
===== Accelerated RFS =====

RSS does not maintain per-flow state. A device MAY also implement the stateful Accelerated RFS (ARFS) algorithm, which explicitly records a preferred queue for a given flow hash. If the device advertises this feature, it MUST be implemented as described in this section.

In Linux, Receive Flow Steering (RFS) is a software algorithm that steers receive processing of a packet to the CPU that last ran an application thread for the same flow. It identifies the flow that a packet belongs to by a flow hash, Optionally and preferably, that is the RSS hash received from the device.

RFS introduces a map from flow hash to CPU. When an application thread interacts with a flow, the host stores the CPU ID in <code>rfs_table[hash % rfs_table_length]</code>. When the host processes a packet from the receive queue, it looks up this table entry, queues the packet on a host queue for the given CPU and sends an inter-processor-interrupt (IPI) to trigger receive processing on the CPU affine with the application thread.

Accelerated RFS moves the RFS table to the device. This directly wakes the RFS affine CPU, skipping over RSS and IPI. The feature can be implemented with an explicit lookup table as described, or as a list of match/action rules that match on a hash or its source fields. In all cases, the action is to queue the packet on a specific queue or RSS context (see below). The host is responsible for storing a queue ID that results in interrupt processing on the same CPU as recorded at the application layer.

If ARFS is supported, regardless of implementation, the device MUST present a match/action API with match on L4 hash and queue selection action. It may offer an API that inserts and/or removes multiple rules at once.

If ARFS is enabled and an ARFS match for a hash is found, then this takes precedence over RSS. Else the device MUST fall back onto RSS.

ARFS is not suitable for all workloads. If connection churn or thread migration is high, it can introduce significant table management communication across the PCI bus.

<span id="self-learning-arfs"></span>
====== Self-learning ARFS ======

ARFS may alternatively be implemented entirely on the device. In this case the device programs the match/action table for ingress matching based on sampling of egress traffic. This requires matching a transmit queue to a receive queue and thus assumes a M:1 mapping of transmit to receive queues. Care must be taken to ensure that self-learning ARFS does not cause packet reordering within a flow.

<span id="programmable-flow-steering"></span>
===== Programmable Flow Steering =====

A device MAY support more complex match rules for flow steering. ARFS matching by hash can be seen as one instance of a broader match language, which may match on

* Ethernet source and/or destination address, protocol
* VLAN identifier
* MPLS label
* IPv4/IPv6 source and/or destination address
* IPv4 other header fields, including ToS, protocol
* IPv6 other header fields, including Traffic Class, next header
* UDP/TCP ports
* TCP flags
* Opaque data+mask bit arrays, with fixed offset and length.

If IPv4 flow steering is supported, then IPv6 MUST also be supported.

If flow steering rules are inserted and a rule matches a packet then this MUST take precedence over RSS and ARFS.

<span id="rss-contexts"></span>
====== RSS contexts ======

Flow steering MAY be combined with RSS to steer a flow not to a single queue, but to a group of receive queues: an RSS context.

A device MAY support multiple RSS contexts, plus extended match/action rules that select an RSS context instead of a queue. Each context is then configured as described in the RSS section. In RSS configuration, a context is selected through an additional numeric ID. The RSS context applied to traffic that does not match any of the explicit flow steering rules MUST be context 0.

It is not required that queues are exclusively assigned to a single RSS context. The same queue may appear in the indirection tables of multiple contexts.

RSS context 0 MUST be able to program the complete RSS indirection table if no other contexts are in use. Other contexts MUST be able to program tables of up to 64 entries each. RSS context 0 SHOULD be able to program all table entries not in use by other contexts.

<span id="transmit-packet-scheduling"></span>
==== Transmit Packet Scheduling ====

When multiple Tx queues have packets outstanding, the device must choose in which order to service the queues.

The device MUST implement equal weight deficit round robin (DRR) as default dequeue algorithm [ref_id_fq_drr]. Deficit round robin is a per-byte algorithm. Time is divided in rounds. Each queue earns a constant number of byte credits during each round, its quantum. The device services queues in a round robin order. If a queue has data outstanding when it is scanned, all packets that add up to less than the queue’s quantum are sent and the credit is reduced accordingly. If one or more packets cannot be sent because the packet at the head of the queue is longer than the remaining quantum, then the remaining quantum carries over to the next round. If the queue is empty at the end of a round, the remaining quantum is reset to zero.

The device SHOULD also support DRR with non-equal weights. Then it MUST support host configuration of the weights. This specification does not prescribe a specific interface to program the weights. In Linux, this feature does not have a standard API as of writing.

The device MAY offer additional algorithms. If strict priority is supported, it SHOULD implement this mode with starvation prevention.

<span id="offloads"></span>