Editing Core Offloads

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
<span id="ocp-server-nic-sw-specification-core-features"></span>
<span id="ocp-server-nic-sw-specification-core-features"></span>
= OCP Server NIC SW Specification: Core Features =
= OCP Server NIC SW Specification: Core Features =
-----
<span id="draft-2023-05-16---first-public-review"></span>
== DRAFT 2023-05-16 - First Public Review ==


<span style="background:orange;bold">
<span style="background:orange;bold">
Automatically converted from Google Docs to Markdown. Some markup may be incorrect. For an authoritative version, read the  
Draft version. Automatically converted from Google Docs to Markdown. Some markup may be incorrect. See the [https://docs.google.com/document/d/1FaVPGYipZ1sPhnYg7KItAS7ivL_svvZP8ZVJeFJezc0/edit?usp=sharing&resourcekey=0-CJlmlfiK_TIuZtX6WnNggg original version] in Google Docs format.
[https://www.opencompute.org/documents/ocp-server-nic-core-features-specification-ocp-spec-format-1-pdf published PDF].
</span>
</span>


 
This is an effort of the OCP [[Networking/NIC Software]] effort
-----
 
This is an effort of the OCP [[Networking/NIC Software]] project


-----
-----
Line 109: Line 111:
=== Contact ===
=== Contact ===


This specification was created through the [https://www.opencompute.org/wiki/Networking/NIC_Software NIC software] effort with the OCP Networking project by OCP member companies Google, Intel, Meta and NVIDIA.
This specification was created through the OCP Networking project by OCP member companies Google, Intel, Meta and NVIDIA.


Comments, questions, suggestions for revisions and requests to join the standard committee can be directed to the OCP Networking mailing list. See [https://www.opencompute.org/projects/networking opencompute.org/projects/networking] for details.
Comments, questions, suggestions for revisions and requests to join the standard committee can be directed to the OCP Networking mailing list. See [https://www.opencompute.org/projects/networking opencompute.org/projects/networking] for details.
This document benefited tremendously from detailed feedback from the wider OCP networking community. The authors want to thank everyone who took the time to review the specification. Your contributions are invaluable.


<span id="i-o-api"></span>
<span id="i-o-api"></span>
Line 145: Line 145:
The device MUST support scatter-gather I/O.
The device MUST support scatter-gather I/O.


Transmitted packets may consist of multiple discrete host memory buffers. The device MUST support a minimum of (MTU / PAGE_SIZE) scatter-gather memory buffers for MTU sized packets, rounded up to the nearest natural number, plus a separate header buffer. For packets with segmentation offload (see below), the device must support this number times the maximum number of supported segments, with an absolute minimum of 17: the minimum number of 4KB pages to span a 64KB TSO packet. Again, plus a separate header buffer.
Transmitted packets may consist of multiple discrete host memory buffers. The device MUST support a minimum of (MTU / PAGE_SIZE) descriptors for MTU sized packets, rounded up to the nearest natural number, plus a separate header buffer. For packets with segmentation offload (see below), the device must support this number times the maximum number of supported segments, with an absolute minimum of 17: the minimum number of 4KB pages to span a 64KB TSO packet. Again, plus a separate header buffer.


For the receive case, the host may choose to post buffers smaller than MTU to the receive queue. The device must support the same limits as for transmit queues: the absolute minimum of 2 buffers per packet and the relative minimum of (MTU / PAGE_SIZE) in the general case, and the absolute minimum of 17 and the relative minimum of N * (MTU / PAGE_SIZE) for large packets produced by Receive Segment Coalescing (RSC, below).
For the receive case, the host may choose to post buffers smaller than MTU to the receive queue. The device must support the same limits as for transmit queues: the absolute minimum of 2 buffers per packet and the relative minimum of (MTU / PAGE_SIZE) in the general case, and the absolute minimum of 17 and the relative minimum of N * (MTU / PAGE_SIZE) for large packets produced by Receive Segment Coalescing (RSC, below).
Line 160: Line 160:
===== Receive Header-Split =====
===== Receive Header-Split =====


A device SHOULD support the special case of receive scatter-gather I/O that split headers from application layer payload. It SHOULD be possible to allocate header and data buffers from separate memory pools.
A device SHOULD support the special case of receive scatter-gather I/O that split headers from application layer payload. It MUST be possible to allocate header and data buffers from separate memory pools. All protocol header buffers for an entire queue SHOULD be allocated as one contiguous DMA region to minimize IOTLB pressure. On packet reception, the host operating system will copy the headers out, so the region has to allocate exactly as many headers as there are descriptors in the queue.
 
All protocol header buffers for an entire queue may be allocated as one contiguous DMA region, to minimize IOTLB pressure. In this model, the host operating system will copy the headers out on packet reception, so the region need only allocate exactly as many headers as there are descriptors in the queue.


Header-split allows direct data placement (DDP) of application payload into user or device memory (e.g., GPUs), while processing protocol headers in the host operating system. The operating system is responsible for ensuring that payload is not loaded into the CPU during protocol processing. Data is placed in posted buffers in the order that it arrives. Transport layer in-order delivery in the context of DDP is out of scope for this spec.
Header-split allows direct data placement (DDP) of application payload into user or device memory (e.g., GPUs), while processing protocol headers in the host operating system. The operating system is responsible for ensuring that payload is not loaded into the CPU during protocol processing. Data is placed in posted buffers in the order that it arrives. Transport layer in-order delivery in the context of DDP is out of scope for this spec.


Header-split SHOULD be implemented by protocol parsing to identify the start of payload. The protocol option space is huge in principle. This spec limits to unencapsulated TCP/IP, which covers the majority of relevant datacenter workloads (crypto is deferred to a future version of the spec). Protocol parsing can fail for many reasons, such as encountering an unknown protocol type. Then the device MUST allow falling back to splitting packets at a fixed offset. This offset SHOULD be host configurable.
Header-split SHOULD be implemented by protocol parsing to identify the start of payload. Protocol parsing can fail for many reasons, such as encountering an unknown protocol type. Then the device MUST allow falling back to splitting packets at a fixed offset. This offset SHOULD be host configurable.


Header-split MAY be implemented with only support for a fixed offset: Fixed Offset Split (FOS). This variant does not require protocol parsing and is thus simpler to implement. Workloads often have a common default protocol layout, such as Ethernet/IPv6/TCP/TSopt. Splitting at 14 + 40 + 20 + 12 will correctly cover this modal packet length and with that the majority of packets arriving on a host. True header split is strongly preferred over FOS, and required at the advanced conformance level. If FOS is implemented, the offset MUST be host configurable.
Header-split MAY be implemented with only support for a fixed offset: Fixed Offset Split (FOS). This variant does not require protocol parsing and is thus simpler to implement. Workloads often have a common default protocol layout, such as Ethernet/IPv6/TCP/TSopt. Splitting at 14 + 40 + 20 + 12 will correctly cover this modal packet length and with that the majority of packets arriving on a host. True header split is strongly preferred over FOS, and required at the advanced conformance level. If FOS is implemented, the offset MUST be host configurable.
Line 196: Line 194:
===== Count =====
===== Count =====


The device SHOULD also support configuring a maximum event count until an interrupt is sent. This triggers an interrupt when a configurable number of events since the last interrupt is reached. Each event corresponds to a single received or transmitted packet. For TSO/RSC packets, the COUNT should count each segment separately. When supporting a maximum event count, the device MUST support values in the range of [2, 128]. It then MUST send an interrupt when either of the two interrupt moderation conditions is met, whichever comes first. Reaching the maximum number of events immediately raises an interrupt regardless of remaining delay, so the delay constitutes an upper bound. Triggering an interrupt for either limit MUST lead to both counters being reset.
The device SHOULD also support configuring a maximum event count until an interrupt is sent. This triggers an interrupt when a configurable number of events since the last interrupt is reached. Each event corresponds to a single received or transmitted packet. For SO packets, the COUNT should count each segment separately. When supporting a maximum event count, the device MUST support values in the range of [2, 128]. It then MUST send an interrupt when either of the two interrupt moderation conditions is met, whichever comes first. Reaching the maximum number of events immediately raises an interrupt regardless of remaining delay, so the delay constitutes an upper bound. Triggering an interrupt for either limit MUST lead to both counters being reset.


<span id="tx-and-rx"></span>
<span id="tx-and-rx"></span>
Line 280: Line 278:
====== Receive Hash ======
====== Receive Hash ======


The computed 32b hash SHOULD be passed to the host alongside the packet. Doing so allows the host to perform additional flow steering without having to compute a hash in software, such as Linux Receive Flow Steering (RFS).
The computed 32b hash MAY be passed to the host alongside the packet. Doing so allows the host to perform additional flow steering without having to compute a hash in software, such as Linux Receive Flow Steering (RFS).
 
A device MAY compute a 64b field to reduce collisions. It MAY communicate this instead, as long as either the 32b Toeplitz hash can be derived or can be communicated alongside.


<span id="indirection-table"></span>
<span id="indirection-table"></span>
====== Indirection Table ======
====== Indirection Table ======


The device MUST select a queue by reducing the hash through modulo arithmetic. It applies division to the hash value and uses the remainder as an index into a fixed number of resources. The divisor is not simply the number of receive queues. RSS specifies an additional level of indirection, the indirection table. This allows for non-uniform load balancing. The device MUST support the RSS indirection table. The device MUST lookup a queue using the following modulo operation:
The device MUST select a queue by reducing the hash through modulo arithmetic. It applies division to the hash value and uses the remainder as an index into a fixed number of resources. The divisor is not simply the number of receive queues. RSS specifies an additional level of indirection, the indirection table. This allows for non-uniform load balancing. The device MUST support the RSS indirection table. The device MUST look a queue using the following modulo operation:


<pre>queue_id = rss_table[rxhash % rss_table_length];</pre>
<pre>queue_id = rss_table[rxhash % rss_table_length];</pre>
The table MUST be host-readable and writable. The host may configure the table with fewer slots than the configured number of receive queues, if the host wants to apply RSS to only a subset of queues. The host may configure the table with more slots than the number of receive queues, for more uniform load balancing. The device may limit the maximum supported table size. The minimum supported indirection table size MUST be <s>the number of supported receive queues</s>. The minimum SHOULD be at least 4 times the number of supported receive queues. The device SHOULD allow querying the maximum supported table size by the host. The device SHOULD allow replacement of the indirection table without pausing network traffic or bringing the device down, to support dynamic rebalancing, e.g., based on CPU load.
The table MUST be host-readable and writable. The host may configure the table with fewer slots than the configured number of receive queues, if the host wants to apply RSS to only a subset of queues. The host may configure the table with more slots than the number of receive queues, for more uniform load balancing. The device may limit the maximum supported table size. The minimum supported indirection table size MUST be at least the number of supported receive queues. The minimum SHOULD be at least 4 times the number of supported receive queues. The device SHOULD allow querying the maximum supported table size by the host. The device SHOULD allow replacement of the indirection table without pausing network traffic or bringing the device down, to support dynamic rebalancing, e.g., based on CPU load.


<span id="accelerated-rfs"></span>
<span id="accelerated-rfs"></span>
Line 351: Line 347:
The device MUST implement equal weight deficit round robin (DRR) as default dequeue algorithm [ref_id_fq_drr]. Deficit round robin is a per-byte algorithm. Time is divided in rounds. Each queue earns a constant number of byte credits during each round, its quantum. The device services queues in a round robin order. If a queue has data outstanding when it is scanned, all packets that add up to less than the queue’s quantum are sent and the credit is reduced accordingly. If one or more packets cannot be sent because the packet at the head of the queue is longer than the remaining quantum, then the remaining quantum carries over to the next round. If the queue is empty at the end of a round, the remaining quantum is reset to zero.
The device MUST implement equal weight deficit round robin (DRR) as default dequeue algorithm [ref_id_fq_drr]. Deficit round robin is a per-byte algorithm. Time is divided in rounds. Each queue earns a constant number of byte credits during each round, its quantum. The device services queues in a round robin order. If a queue has data outstanding when it is scanned, all packets that add up to less than the queue’s quantum are sent and the credit is reduced accordingly. If one or more packets cannot be sent because the packet at the head of the queue is longer than the remaining quantum, then the remaining quantum carries over to the next round. If the queue is empty at the end of a round, the remaining quantum is reset to zero.


The device SHOULD also support DRR with non-equal weights. Then it MUST support host configuration of the weights. This specification does not prescribe a specific interface to program the weights. In Linux, this feature does not have a standard API as of writing.
The device SHOULD also support DRR with non-equal weights. Then it MUST support host configuration of the weights.


The device MAY offer additional algorithms. If strict priority is supported, it SHOULD implement this mode with starvation prevention.
The device MAY offer additional algorithms. If strict priority is supported, it SHOULD implement this mode with starvation prevention.
Line 381: Line 377:


Devices with a programmable hardware parser allow the administrator to push firmware updates to support custom protocols. A programmable parser is still strictly less desirable than protocol independent offloads, as programmable parsers introduce correlated roll-outs between software and firmware. At hyperscale, correlated roll-outs and potential roll-backs add significant complexity and risk.
Devices with a programmable hardware parser allow the administrator to push firmware updates to support custom protocols. A programmable parser is still strictly less desirable than protocol independent offloads, as programmable parsers introduce correlated roll-outs between software and firmware. At hyperscale, correlated roll-outs and potential roll-backs add significant complexity and risk.
This target of protocol independence is in conflict with some features defined in this spec (RSS, header-split, etcetera). That is why the prescriptive opening sentence of this section starts with “where possible”. Where features can be implemented without parsing, that design MUST be taken.


<span id="checksum-offload"></span>
<span id="checksum-offload"></span>
Line 419: Line 413:
A device MUST be able to verify ones’ complement checksums. The device SHOULD implement the feature in a protocol independent manner.
A device MUST be able to verify ones’ complement checksums. The device SHOULD implement the feature in a protocol independent manner.


Protocol independent linear ones’ complement (PILOC) receive checksum offload computes the ones’ complement sum over the entire packet exactly as passed by the driver to the host, for every packet, excluding only the 14B Ethernet header. The sum MUST exclude the Ethernet header. It MUST include all headers after this header, including VLAN tags if present. It MUST exclude all fields not passed to the host, such as possible crypto protocol MAC footers.
Protocol independent linear ones’ colement (PILOC) receive checksum offload computes the ones’ complement sum over the entire packet exactly as passed by the driver to the host, for every packet, excluding only the 14B Ethernet header. The sum MUST exclude the Ethernet header. It MUST include all headers after this header, including VLAN tags if present. It MUST exclude all fields not passed to the host, such as possible crypto protocol MAC footers.


It MUST be possible for the host to independently verify checksum correctness by computing the same sum in software. This is impossible if the checksum includes bytes removed by the device, such as an Ethernet FCS.
It MUST be possible for the host to independently verify checksum correctness by computing the same sum in software. This is impossible if the checksum includes bytes removed by the device, such as an Ethernet FCS.


Legacy devices MAY instead return only a boolean value with the packet that signals whether a checksum was successfully verified. This approach is strongly discouraged. If this approach is chosen, then the device MUST checksum only the outermost TCP or non-zero UDP checksum (if it verifies a checksum at all) and MUST return true only if this checksum can be verified. The device SHOULD then compute the sum over the pseudo-header, L4 header and payload, including the checksum field, and verify that this sums up to zero. Note that both negative and positive zero MUST be interpreted as valid sums, for all protocols except UDP. Only for UDP does the all-zeroes checksum 0x0000 indicate that the checksum should not be verified. An implementation returning a PILOC sum does not require extra logic to address these protocol variations.
Legacy devices MAY instead return only a boolean value with the packet that signals whether a checksum was successfully verified. This approach is strongly discouraged. If this approach is chosen, then the device MUST checksum only the outermost UDP or TCP checksum (if it verifies a checksum at all) and MUST return true only if this checksum can be verified. The device SHOULD then compute the sum over the pseudo-header, L4 header and payload, including the checksum field, and verify that this sums up to zero. Note that both negative and positive zero MUST be interpreted as valid sums, for all protocols except UDP. Only for UDP does the all-zeroes checksum 0x0000 indicate that the checksum should not be verified. An implementation returning a PILOC sum does not require extra logic to address these protocol variations.


The device MUST pass all packets to the host, including those that appear to fail checksum verification. The host must be able to account, verify and report such packets.
The device MUST pass all packets to the host, including those that appear to fail checksum verification. The host must be able to account, verify and report such packets.
Line 444: Line 438:
===== Copy Headers and Split Payload =====
===== Copy Headers and Split Payload =====


In an abstract model of segmentation offload, the device splits SO packet payload into segment sized chunks and copies the SO packet protocol headers to each segment. We refer to this basic mechanism as copy-headers-and-split-payload (CH/SP). The host communicates an unsigned integer segment size to the device along with the packet. This field must be large enough to cover the L3 MTU range: 16b is customary, but not strictly required to meet this goal. If segment size is not a divisor of total payload length, then the last packet in the segment chain will be shorter. The device MUST NOT attempt to compute or derive segment size, because establishing that is a complex process of path MTU and transport MSS discovery, more suitable to be implemented in software in the host protocol stack.
In an abstract model of segmentation offload, the device splits SO packet payload into segment sized chunks and copies the SO packet protocol headers to each segment. We refer to this basic mechanism as copy-headers-and-split-payload (CH/SP). The host communicates a 16-bit unsigned integer segment size to the device along with the packet. If segment size is not a divisor of total payload length, then the last packet in the segment chain will be shorter. The device MUST NOT attempt to compute or derive segment size, because establishing that is a complex process of path MTU and transport MSS discovery, more suitable to be implemented in software in the host protocol stack.


CH/SP is a simplified model. For specific protocols, segmentation offload can have subtle exceptions in how protocol header fields must be updated after copy. This spec explicitly defines all cases that diverge from pure CH/SP. The ground truth is the software segmentation implementation in Linux v6.3. If the two disagree, that source code takes precedence.
CH/SP is a simplified model. For specific protocols, segmentation offload can have subtle exceptions in how protocol header fields must be updated after copy. This spec explicitly defines all cases that diverge from pure CH/SP. The ground truth is the software segmentation implementation in Linux v6.3. If the two disagree, that source code takes precedence.
Line 453: Line 447:
A device MUST support TCP Segmentation Offload (TSO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature. The device MUST support TSO with TCP options.
A device MUST support TCP Segmentation Offload (TSO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature. The device MUST support TSO with TCP options.


The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between the IPv4 or IPv6 and TCP header. As with other fields, the device should treat these bytes as opaque and copy them unconditionally unless otherwise specified.
The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between the IPv4 or IPv6 and TCP header.


TCP is particularly suitable for segmentation offload because at the user interface TCP is defined as a bytestream. By this definition, the user may have no expectations of how data is segmented into packets, in contrast with datagrams or message based protocols.
TCP is particularly suitable for segmentation offload because at the user interface TCP is defined as a bytestream. By this definition, the user may have no expectations of how data is segmented into packets, in contrast with datagrams or message based protocols.
Line 488: Line 482:


A device SHOULD support UDP Segmentation Offload (USO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature.
A device SHOULD support UDP Segmentation Offload (USO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature.
USO allows sending multiple UDP datagrams in a single operation. The host passes to the device a UDP packet plus segment size field. The device splits the datagram payload on segment size boundaries and replaces the UDP header to each segment.
USO is NOT the same as UDP fragmentation offload (UFO). That sends a datagram larger than MTU size, by relying on IP fragmentation. UFO is out of scope of this spec. Unlike UFO, USO does not maintain ordering. Datagrams may arrive out of order, same as if they were sent one at a time.


The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between IPv4 or IPv6 header and UDP header.
The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between IPv4 or IPv6 header and UDP header.
Line 533: Line 523:
==== Jumbogram Segmentation Offload ====
==== Jumbogram Segmentation Offload ====


The device SHOULD support IPv4 and IPv6 jumbogram SO packets that exceed the 64 KB maximum IP packet size.
The device SHOULD support IPv6 jumbogram SO packets that exceed the 64 KB maximum IP packet size.
 
IPv6 headers have a 16-bit payload length field, so the largest possible standard IPv6 packet is 64 KB + IPv6 header (payload length includes IPv6 extension headers, if any).


IPv6 headers have a 16-bit payload length field, so the largest possible standard IPv6 packet is 64 KB + IPv6 header (payload length includes IPv6 extension headers, if any). IPv4 headers have a 16bit total length field, so the largest possible IPv4 packet is slightly smaller: 64KB including header.
RFC 2675 defines an IPv6 jumbo payload option, with which IPv6 packets can support up to 4GB of payload. This configuration sets the payload length field to zero and appends a hop-by-hop next header with jumbo payload option.


Jumbogram segmentation offload ignores the IPv6 payload length and IPv4 total length fields if zero. The host must then communicate the real length of the entire packet to the device out-of-band of the packet, likely as a descriptor field.
Jumbogram segmentation offload ignores the IPv6 payload length field if zero. The host must then communicate the real length of the entire packet to the device out-of-band of the packet, likely as a descriptor field.The device can use the established TSO, USO and PISO rules to derive the total payload length from the total packet length.


RFC 2675 defines an IPv6 jumbo payload option, with which IPv6 packets can support up to 4GB of payload. This configuration sets the payload length field to zero and appends a hop-by-hop next header with jumbo payload option. Unlike for IPv6 jumbograms that are sent as jumbograms on the wire, it is NOT necessary for IPv6 jumbo segmentation offload to include this jumbo payload hop-by-hop next header, as the segments themselves will not be jumbograms.
Unlike for IPv6 jumbograms that are sent as jumbograms on the wire, it is not necessary for IPv6 jumbo segmentation offload to include a jumbo payload hop-by-hop next header, if the segments themselves will not be jumbograms.


<span id="receive-segment-coalescing"></span>
<span id="receive-segment-coalescing"></span>
Line 558: Line 550:
===== Segment size =====
===== Segment size =====


The device MUST pass to the host along with the large (SO) packet, a segment size field that encodes the payload length of the original packets. This field implies that packets are only coalesced if they have the same size on the wire. Coalescing stops if a packet arrives of different size. If it is larger than the previous packets, it cannot be appended. If it is smaller, it can be. If segment size is not a divisor of the SO packet payload, then the remainder encodes the payload length of this last packet.
The device MUST pass to the host along with the large (SO) packet, a segment size field that encodes the payload length of the original packets. This field implies that packets are only coalesced if they have the same size on the wire. Coalescing stops if a packet arrives of different size. If it is larger than the previous packets, it cannot be appended. If it is smaller, it can be. If segment size is not a divisor of the SO packet payload, then theremainder encodes the payload length of this last packet.


''Reversibility''
''Reversibility''
Line 630: Line 622:


* IPv4 total length is updated to match the SO packet.
* IPv4 total length is updated to match the SO packet.
** Or set to zero and the below jumbo rules apply.
* IPv6 payload length is updated to match the SO packet.
* IPv6 payload length is updated to match the SO packet.
** Or set to zero and the below jumbo rules apply.
* IPv4 IP ID is the ID of the first segment.
* IPv4 IP ID is the ID of the first segment.
* IPv4 checksum is valid.
* IPv4 checksum is valid.
<span id="jumbogram-receive-segmentation-offload"></span>
===== Jumbogram Receive Segmentation Offload =====
Devices SHOULD support coalescing of packet streams that exceed the maximum IPv4 or IPv6 packet size. Jumbogram RSC is the inverse of Jumbogram Segmentation Offload. It solves the length field limitation in the same way: the length field MUST be set to zero and the length communicated out-of-band, likely as a descriptor field.
Jumbogram RSC MUST only be applied if total length exceeds the IPv4 total length or IPv6 payload length field.


<span id="timestamping"></span>
<span id="timestamping"></span>
Line 722: Line 705:
==== Ingress ====
==== Ingress ====


An ingress queue can build up on the device due to incast. If a standing queue can build up in the device, the device SHOULD mitigate head of line blocking of high priority traffic, by prioritizing traffic based on IP DSCP bits. The device MUST offer at least two traffic bands and MUST support host configurable mapping of DSCP bits to band. The device SHOULD offer weighted round robin (WRR) dequeue with weights configurable by the host. It may implement strict priority. If so, this MUST include starvation prevention with a minimum of 10% of bandwidth for every queue.
An ingress queue can build up on the device due to incast. If a standing queue can build up in the device, the device SHOULD mitigate head of line blocking of high priority traffic, by prioritizing traffic based on IP DSCP bits. The device MUST offer at least two traffic bands and MUST support host configurable mapping of DSCP bits to band. The device SHOULD offer weighted round robin (WRR) dequeue with weights configurable by the host. It may implement strict priority. If so, this MUST include starvation prevention with a minimumof 10% of bandwidth for every queue.


<span id="egress"></span>
<span id="egress"></span>
Line 738: Line 721:
This feature relies on comparing packet departure time against a device clock. It thus depends on a device hardware clock and host clock synchronization as described in the section on timestamping. It requires a transmit descriptor field to encode the departure time.
This feature relies on comparing packet departure time against a device clock. It thus depends on a device hardware clock and host clock synchronization as described in the section on timestamping. It requires a transmit descriptor field to encode the departure time.


If the device supports EDT, then it MUST implement this according to the following rules. It MUST send without delay packets which have no departure time set or for which the departure time is in the past. It MUST NOT send a packet with a departure time before that departure time under any conditions. Departure time resolution MUST be 2us or smaller. The device MUST be able to accept and queue packets with a departure time up to 50 msec in the future. This “time horizon” is based on congestion control algorithms’ forward looking window. The device likely also has a global maximum storage capacity. The requirement that departure times up to 50 msec must be programmable DOES NOT imply that the device has to support enough storage space to queue up to 50 msec of data: actual packet spacing may be sparse. It SHOULD NOT have a maximum per interval capacity. The vendor MUST report all such bounds. The device MAY support a special slot for queueing packets with a time beyond the time horizon, or it may choose to drop those. The device MUST expose a counter for all packets dropped by the timing wheel due to either resource exhaustion or departure time beyond the horizon. The device SHOULD signal in a transmit completion when a packet was dropped rather than sent.
If the device supports EDT, then it MUST implement this according to the following rules. It MUST send without delay packets which have no departure time set or for which the departure time is in the past. It MUST NOT send a packet with a departure time before that departure time under any conditions. Departure time resolution MUST be 2us or smaller. The device MUST be able to accept and queue packets with a departure time up to 50 msec in the future. This “time horizon” is based on congestion control algorithms’ forward looking window. The device likely also has a global maximum storage capacity. It SHOULD NOT have a maximum per interval capacity. The vendor MUST report all such bounds. The device MAY support a special slot for queueing packets with a time beyond the time horizon, or it may choose to drop those. The device MUST expose a counter for all packets dropped by the timing wheel due to either resource exhaustion or departure time beyond the horizon. The device SHOULD signal in a transmit completion when a packet was dropped rather than sent.


<span id="protocol-support"></span>
<span id="protocol-support"></span>
Line 940: Line 923:
===== Single Flow =====
===== Single Flow =====


Single flow MUST reach 40 Gbps with 1500B MTU and TSO. A single TCP/IP flow can reach 100 Gbps line rate when using TSO, 4KB MSS and copy avoidance_id[tcp_rx_0copy], but this is a less common setup. Single flow line rate is not a hard requirement, especially as device speeds exceed 100 Gbps.
Single flow MUST reach 40 Gbps with 1500B MTU and TSO. A single TCP/IP flow can reach 100 Gbps line rate when using TSO, 4KB MSS and copy avoidance_id[tcp_rx_0copy], but this is a less common setup. Single flow line rate is not a hard requirement, especiallas device speeds exceed 100 Gbps.


<span id="peak-stress-and-endurance-results"></span>
<span id="peak-stress-and-endurance-results"></span>
Line 967: Line 950:
The vendor MUST report maximum packet rate BOTH with a chosen optimal configuration and with a single pair of receive and transmit queues.
The vendor MUST report maximum packet rate BOTH with a chosen optimal configuration and with a single pair of receive and transmit queues.


The performance metrics should remain reasonably constant with queue count: packet rate at any number of queues 8 or higher SHOULD be no worse than 80% of the best case packet rate. If this cannot be met, the vendor MUST also report the worst case queue configuration and its packet rate. This to avoid surprises as the user deploys the device and tunes configuration.
The performance metrics should remain reasonably constant with queue count: packet rate at any number of queues above 8 SHOULD be no worse than 80% of the best case packet rate. If this cannot be met, the vendor MUST also report the worst case queue configuration and its packet rate. This to avoid surprises as the user deploys the device and tunes configuration.
 
<span id="connection-count-and-rate"></span>
==== Connection Count and Rate ====
 
Most NIC features operate below the transport layer. Where features do interact with the transport layer, the NIC has to demonstrate to be able to reach observed datacenter server workloads.
 
The NIC MUST scale to 10M open TCP/IP connections and 100K connection establishments + terminations (each) per second. It MUST be able to achieve this with no more than 100 CPU cores. This limit is not overly aggressive: it was chosen with significant room above what is observed in production.


<span id="latency"></span>
<span id="latency"></span>
Line 1,569: Line 1,545:
<td>
<td>
[2, 50] us
[2, 50] us
</td>
</tr>
<tr>
<td>
</td>
<td>
Jumbogram RSC
</td>
<td>
optional
</td>
<td>
</td>
</td>
</tr>
</tr>
Line 1,842: Line 1,806:
<td>
<td>
max pps / 8
max pps / 8
</td>
</tr>
<tr>
<td>
</td>
<td>
CPS: 10M concurrent open TCP/IP connections
</td>
<td>
basic
</td>
<td>
with &lt;= 100 CPUs
</td>
</tr>
<tr>
<td>
</td>
<td>
CPS: 100K TCP/IP opens + closes
</td>
<td>
basic
</td>
<td>
with &lt;= 100 CPUs
</td>
</td>
</tr>
</tr>
Line 2,079: Line 2,017:
<pre>For BYTE in {1..64}:
<pre>For BYTE in {1..64}:


     ip link add link eth0 dev eth0.$BYTE address 22:22:22:22:22:$BYTE type macvlan</pre>
     ip link add link eth0 dev eth0.$BYTE address 22::22:22:22:$BYTE type macvlan</pre>
The device must support promiscuous (all addresses) and allmulti (all multicast addresses) modes:
The device must support promiscuous (all addresses) and allmulti (all multicast addresses) modes:


Line 2,186: Line 2,124:
* Disable CPU sleep states (C-states), frequency scaling (P-states) and turbo modes.
* Disable CPU sleep states (C-states), frequency scaling (P-states) and turbo modes.
* Disable hyperthreading
* Disable hyperthreading
* Disable IOMMU
* Disable IOMMU* Pin process threads
* Pin process threads
* Memory distance: pin threads and IRQ handlers to the same NUMA node or cache partition
* Memory distance: pin threads and IRQ handlers to the same NUMA node or cache partition
** Select the NUMA node to which the NIC is connected
** Select the NUMA node to which the NIC is connected
Line 2,240: Line 2,177:


A pure Linux solution for packet processing can be built using eXpress Data Path (XDP). Packets must be generated on the host as close to the device as possible. A device that supports AF_XDP, in native driver mode, with copy avoidance and busy polling, has been shown to reach 30 Mpps on a 40 Gbps NIC using the rx_drop benchmark that ships with the Linux kernel. Over 100 Mpps has been demonstrated on 100 Gbps NICs, but these results are not publicly published.
A pure Linux solution for packet processing can be built using eXpress Data Path (XDP). Packets must be generated on the host as close to the device as possible. A device that supports AF_XDP, in native driver mode, with copy avoidance and busy polling, has been shown to reach 30 Mpps on a 40 Gbps NIC using the rx_drop benchmark that ships with the Linux kernel. Over 100 Mpps has been demonstrated on 100 Gbps NICs, but these results are not publicly published.
<span id="connection-rate"></span>
==== Connection Rate ====
Neper tcp_crr (“connect-request-response”) can demonstrate connection establishment and termination rate. The expressed target is 100K TCP/IP connections per second, with no more than 100 CPU cores. tcp_crr is invoked similar to tcp_rr, but created a separate connection for each request. Demonstrate with the boundary number of CPUs or fewer. Ideal
<pre>    tcp_crr -T $NUM_CPU -F $NUM_FLOWS [-c -H $SERVER]</pre>
'''Connection Count'''
There is no current test to demonstrate reaching 10M concurrent connection count


<span id="latency-1"></span>
<span id="latency-1"></span>
Line 2,284: Line 2,211:
<td>
<td>
Initial public draft
Initial public draft
</td>
</tr>
<tr>
<td>
0.9.1
</td>
<td>
2023-07-13
</td>
<td>
Pre-publication: incorporated public feedback
</td>
</td>
</tr>
</tr>
Please note that all contributions to OpenCompute may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see OpenCompute:Copyrights for details). Do not submit copyrighted work without permission!
Cancel Editing help (opens in new window)