Editing Core Offloads (section)

== Performance ==

The device must demonstrate to meet its advertised performance targets in the intended operating environment. The vendor MUST document a reproducible test setup that demonstrates to meet all performance requirements. Appendix B lists suggestions for specific performance testing on Linux. Those are not prescriptive.

Targets must be met with a relatively standard off-the-shelf server that is representative of the intended target environment. For a 100 Gbps configuration, it is suggested to have a single CPU socket with at least 16 CPU cores and a network configuration with 16 receive and transmit queues and RSS load-balancing.

Performance targets cover bitrate (bps), packet rate (pps) and NIC pipeline latency (nsec).

<span id="bitrate"></span>
==== Bitrate ====

Bitrate is the metric by which a device is often advertised, e.g., a 100 Gbps NIC.

<span id="variants"></span>
===== Variants =====

All following variants SHOULD reach the advertised line rate with

* TSO on and off
** if PISO is supported, across a UDP tunnel
* RSC on and off
* IOMMU on and off
* 1500B, 4168B and 9198B L3MTU
* Unidirectional and bidirectional traffic
* Scalability
** 10, 100, 1K, 10K flows
** 1, 10, NUM_CPU threads
** 1, 10, NUM_CPU queues

<span id="single-flow"></span>
===== Single Flow =====

Single flow MUST reach 40 Gbps with 1500B MTU and TSO. A single TCP/IP flow can reach 100 Gbps line rate when using TSO, 4KB MSS and copy avoidance_id[tcp_rx_0copy], but this is a less common setup. Single flow line rate is not a hard requirement, especially as device speeds exceed 100 Gbps.

<span id="peak-stress-and-endurance-results"></span>
===== Peak, Stress and Endurance Results =====

Short test runs can show best case numbers. Deployment requires sustained performance.

Endurance tests can expose memory leaks and rare unrecoverable edge cases, e.g., those that result in device or queue timeout. Endurance tests essentially run the same testsuite over longer periods of time. Reported numbers for 1 hour runs MUST stay constant and match short term numbers.

Stress tests test specific adverse conditions. They need not be as long as endurance tests. Performance during adverse conditions may be lower than best case, but not catastrophically so. Device and driver are expected to handle overload gracefully. They MUST be resistant to Denial of Service (DoS) and incast. If max packet rate for minimal packets is less than line rate, it SHOULD be constant regardless of packet arrival rate.

<span id="bus-contention"></span>
====== Bus Contention ======

Network traffic competes with other tasks for PCIe and memory bandwidth. Some micro-architectural considerations, such as NUMA or cache sizes and partitioning, cannot be controlled. But devices can be compared to the extent that they stress the PCIe or memory bus for the same traffic: how many PCIe messages are required to transfer the same number of packets of a given size is an indicator for real world throughput under bus contention.

This efficiency is evaluated by repeating the testsuite while running a memory antagonist. An effective memory antagonist on Unix environments is a pinned dd binary copying in-memory virtual files. A device SHOULD minimize the number of PCIe messages needed (see the section on PCIe Cache Aligned Stores) to reduce sensitivity to concurrent workloads.

<span id="packet-rate"></span>
==== Packet Rate ====

The vendor MUST report a maximum packet rate, and MUST demonstrate that the device reaches this rate.

''Scalability: Queue Count''

The vendor MUST report maximum packet rate BOTH with a chosen optimal configuration and with a single pair of receive and transmit queues.

The performance metrics should remain reasonably constant with queue count: packet rate at any number of queues 8 or higher SHOULD be no worse than 80% of the best case packet rate. If this cannot be met, the vendor MUST also report the worst case queue configuration and its packet rate. This to avoid surprises as the user deploys the device and tunes configuration.

<span id="connection-count-and-rate"></span>
==== Connection Count and Rate ====

Most NIC features operate below the transport layer. Where features do interact with the transport layer, the NIC has to demonstrate to be able to reach observed datacenter server workloads.

The NIC MUST scale to 10M open TCP/IP connections and 100K connection establishments + terminations (each) per second. It MUST be able to achieve this with no more than 100 CPU cores. This limit is not overly aggressive: it was chosen with significant room above what is observed in production.

<span id="latency"></span>
==== Latency ====

Rx and Tx pipeline latency for standard Ethernet packets SHOULD NOT exceed 2 usec each, MUST NOT exceed 4 usec at 90% and MUST NOT exceed 20 us at 99%, as measured under optimal conditions with no competing workload generating bus contention. If one of these bounds cannot be met, then this MUST be reported. This requirement must be met for a standard device configuration. That is, checksum offload and RSS must be enabled. Interrupt delay is not included, so interrupt moderation may be disabled or interrupts disabled in favor of polling. This measurement is for packets that do not exercise TSO/USO/PISO or RSC.

Tx pipeline latency is the time from when the host signals to the device that work is pending, to packet transmission as defined in the timestamping section. Rx pipeline latency is reception as defined in the timestamping section until the descriptor is first readable by the host. In practice, measurement compares hardware PHY timestamps to best case conditions for host software timestamp measurement, which always overestimates to a degree. The chosen bounds are experimentally arrived at in the same manner and take this measurement error into account. See Appendix B for more details on testing.

<span id="appendix-a-checklist"></span>