Editing Core Offloads (section)

=== Queues ===

The device and driver communicate through descriptor queue buffers in host RAM. The format of individual descriptors is device dependent, out of scope for this document.

For host to device communication, the descriptor queue is combined with a MMIO writable doorbell register exposed through a PCI BAR to write the host producer index and notify the device of new data.

For device to host communication, the device may write the device producer index to an agreed on location in host RAM. Alternatively, the explicit producer index field may be replaced by embedding a generation bit in packet descriptors to detect the head of the queue.

Each queue MUST be able to hold at least 4096 single-buffer packets at a time. The exact length should be configurable, in which case it MUST be host configurable. The device SHOULD support queue reconfiguration while the link remains up.

The asynchronous queue is associated with an IRQ for optional interrupt driven processing.

<span id="post-and-completion-queues"></span>
==== Post and Completion Queues ====

Each logical queue SHOULD consist of a pair of post and completion queues. Post queues post buffers in host RAM from host to device. Completion queues post ready events from device to host.

Post queues MAY be shared between queues, such that a single post queue can supply multiple receive queues. Completion queues MAY be shared between queues, such that a single completion queue can return buffers posted to the device from multiple post queues. If shared queues are supported, then this MUST be an optional feature. All unqualified declarations of supported number of queues MUST be calculated with no sharing.

<span id="scatter-gather-io"></span>
==== Scatter-Gather I/O ====

The device MUST support scatter-gather I/O.

Transmitted packets may consist of multiple discrete host memory buffers. The device MUST support a minimum of (MTU / PAGE_SIZE) scatter-gather memory buffers for MTU sized packets, rounded up to the nearest natural number, plus a separate header buffer. For packets with segmentation offload (see below), the device must support this number times the maximum number of supported segments, with an absolute minimum of 17: the minimum number of 4KB pages to span a 64KB TSO packet. Again, plus a separate header buffer.

For the receive case, the host may choose to post buffers smaller than MTU to the receive queue. The device must support the same limits as for transmit queues: the absolute minimum of 2 buffers per packet and the relative minimum of (MTU / PAGE_SIZE) in the general case, and the absolute minimum of 17 and the relative minimum of N * (MTU / PAGE_SIZE) for large packets produced by Receive Segment Coalescing (RSC, below).

''Optimization: RAM Conservation''

A device MAY support scatter-gather I/O with multiple buffer sizes. It may support the driver posting multiple buffer sizes to the device. One approach stripes different buffers of expected header and payload sizes in the same post queue. Another is to associate multiple post queues with a receive completion queue, where each post queue posts buffers of a single size. The device then selects for each packet the smallest buffer(s) suitable for storing it. A practical example is supporting 9K jumbo frames in environments where the majority of traffic may consist of standard 1500B frames and smaller pure ACK style packets.

The device MAY also support sharing post queues among receive completion queues. This mitigates scale-out cost. In receive processing, buffers have to be posted to the device in anticipation of packet arrival. With many queues, total posted memory can add up to a significant amount of RAM allocated to the device.

Devices MAY also support an “emergency reserve” queue, a single extra queue of buffers available to use on any receive queue, if the buffers dedicated to that queue are depleted. This allows the host to post fewer dedicated buffers while avoiding the risk of transient traffic bursts leading to drops.

<span id="receive-header-split"></span>
===== Receive Header-Split =====

A device SHOULD support the special case of receive scatter-gather I/O that split headers from application layer payload. It SHOULD be possible to allocate header and data buffers from separate memory pools.

All protocol header buffers for an entire queue may be allocated as one contiguous DMA region, to minimize IOTLB pressure. In this model, the host operating system will copy the headers out on packet reception, so the region need only allocate exactly as many headers as there are descriptors in the queue.

Header-split allows direct data placement (DDP) of application payload into user or device memory (e.g., GPUs), while processing protocol headers in the host operating system. The operating system is responsible for ensuring that payload is not loaded into the CPU during protocol processing. Data is placed in posted buffers in the order that it arrives. Transport layer in-order delivery in the context of DDP is out of scope for this spec.

Header-split SHOULD be implemented by protocol parsing to identify the start of payload. The protocol option space is huge in principle. This spec limits to unencapsulated TCP/IP, which covers the majority of relevant datacenter workloads (crypto is deferred to a future version of the spec). Protocol parsing can fail for many reasons, such as encountering an unknown protocol type. Then the device MUST allow falling back to splitting packets at a fixed offset. This offset SHOULD be host configurable.

Header-split MAY be implemented with only support for a fixed offset: Fixed Offset Split (FOS). This variant does not require protocol parsing and is thus simpler to implement. Workloads often have a common default protocol layout, such as Ethernet/IPv6/TCP/TSopt. Splitting at 14 + 40 + 20 + 12 will correctly cover this modal packet length and with that the majority of packets arriving on a host. True header split is strongly preferred over FOS, and required at the advanced conformance level. If FOS is implemented, the offset MUST be host configurable.

''PCIe Cache Aligned Stores''

Stores from device to host memory SHOULD be complete cache lines when possible. The device SHOULD store the last cacheline of a packet with padding to avoid a RMW cycle. It SHOULD do the same for headers when header-split is enabled.

A partial write results in a read-modify-write (RMW) cycle across the PCIe bus, increasing latency and bus contention. With current Ethernet, PCIe and memory speeds, this has been observed to cause significant bus contention and packet drops in practice. That behavior can escape synthetic network benchmarks, but is apparent in real-world deployments, where memory and PCI see contention from other applications and devices besides networking.

<span id="interrupts"></span>