Editing Core Offloads (section)

== Offloads ==

Fixed function hardware is more efficient at specific operations than a general purpose CPU. It is an effective optimization for both memory intensive operations such as checksum computation, and computationally expensive operations like encryption.

<span id="design-principles"></span>
=== Design principles ===

<span id="stateless"></span>
==== Stateless ====

Where possible, offloads SHOULD be implemented in a stateless manner. That is, all information associated consumed or produced for a packet is communicated along with the packet.

A stateful implementation may store per-flow state on the device, requiring additional communication to keep host and device state in sync. At the scale of modern servers and hyperscale deployments, this adds complexity and additional performance limitations. It should be avoided where possible. See the performance section for a summary of the canonical workloads and their scale.

<span id="protocol-independence"></span>
==== Protocol Independence ====

Where possible, offloads SHOULD be implemented in a protocol independent manner. Protocol dependent offloads are fragile, in that they break when protocols are revised or replaced.

Tunneling and transport layer encapsulation are common in hyperscale systems. Protocols used may be standard, such as Generic Routing Encapsulation (GRE) as defined in IETF RFCs 2784 and 2890. But even standard protocols can be extended, for example GRE key and sequence number extensions of RFC 2890. Hyperscale providers operate in a closed world, and as such are not limited to standardized protocols. They may extend standard headers with proprietary fields or entirely replace standard protocols with custom ones. They may stacks protocols in arbitrary ways to encapsulate multiple layers of information, e.g., for routing, traffic shaping and sharing application metadata out-of-band (OOB).

<span id="programmable-parsers"></span>
===== Programmable Parsers =====

Devices with a programmable hardware parser allow the administrator to push firmware updates to support custom protocols. A programmable parser is still strictly less desirable than protocol independent offloads, as programmable parsers introduce correlated roll-outs between software and firmware. At hyperscale, correlated roll-outs and potential roll-backs add significant complexity and risk.

This target of protocol independence is in conflict with some features defined in this spec (RSS, header-split, etcetera). That is why the prescriptive opening sentence of this section starts with “where possible”. Where features can be implemented without parsing, that design MUST be taken.

<span id="checksum-offload"></span>
=== Checksum Offload ===

The device MUST support TCP and UDP checksum offload, for both IPv4 and IPv6, on both receive and transmit. The device SHOULD implement these features in a protocol-independent manner, by checksumming a linear range of bytes.

<span id="transmit-checksum"></span>
==== Transmit Checksum ====

The device MUST be able to insert a checksum in the TCP header such that the checksum conforms to IETF RFC 793 section 3.1.

The device SHOULD implement this feature in the form of a protocol independent linear ones’ complement (PILOC) checksum offload. In PILOC, the host communicates the start of the byte range to sum within the packet: checksum_start. It also specifies a positive insert offset from this byte where the sum must be stored: checksum_offset. The last byte is specified implicitly: summing continues to the end of the packet payload, excluding any (e.g., link layer) trailers. The insert offset MUST support both 6B for a UDP header checksum field offset and 16B for a TCP header.

With a PILOC checksum implementation, the device does not need to be aware of protocol specific complexity, such as a pseudo header checksum. The host will insert the pseudo header sum at checksum_start + checksum_offset, to include this in the linear sum. The device therefore MUST NOT clear this field before computing the linear sum.

Legacy devices MAY implement transmit checksum offload in a protocol dependent manner, fully in hardware. In this case it parses the packet to find the start of the TCP or UDP header and the preceding IPv4 or IPv6 fields that must be included in the pseudo header. This approach is strongly discouraged, as (1) the innermost transport header may be preceded by protocol headers that the hardware cannot parse, (2) with tunnel encapsulation, a packet may contain multiple transport headers, (3) it may exclude other protocols with checksum fields, such as Generic Route Encapsulation (GRE).

<span id="multiple-checksums"></span>
===== Multiple Checksums =====

A packet can contain multiple transport headers, some or all of which require valid checksums. A device that implements PILOC sums MUST NOT insert multiple checksums. It only inserts a single checksum at the location requested by the host. The host can prepare all other checksums in software efficiently. Once the device inserts the innermost checksum, by definition the innermost packet sums up to zero (including pseudo header). Any preceding checksums therefore can be computed by summing over headers only. Local Checksum Offload (LCO) [ref_id:csum] computes checksums of all but the innermost transport header efficiently in software without loading any payload bytes.

<span id="udp-zero-checksum-conversion"></span>
===== UDP Zero Checksum Conversion =====

The UDP protocol as specified in IETF RFC 768 introduces a special case that MUST be handled correctly when computing checksums. A checksum that sums up to zero MUST be stored in the checksum field as negative zero in ones’ complement arithmetic: 0xFFFF. A device MAY apply the same logic to all checksums in a protocol independent manner.

Transport checksums are computed with ones’ complement arithmetic. In this arithmetic, a positive integer is converted to its negative complement by flipping all bits, and vice versa. Adding any number and its complement will produce all ones, or 0xFFFF. Every number thus has a complement. This includes zero: both 0x000 and 0xFFFF represent zero.

RFC 768 adds explicit support for transmitting a datagram without checksum. This is signaled by setting the checksum field to 0x0000. To distinguish this lack of checksum from a computed checksum that sums up to zero, a sum that adds up to 0 MUST be written as 0xFFFF.

<span id="receive-checksum"></span>
==== Receive Checksum ====

A device MUST be able to verify ones’ complement checksums. The device SHOULD implement the feature in a protocol independent manner.

Protocol independent linear ones’ complement (PILOC) receive checksum offload computes the ones’ complement sum over the entire packet exactly as passed by the driver to the host, for every packet, excluding only the 14B Ethernet header. The sum MUST exclude the Ethernet header. It MUST include all headers after this header, including VLAN tags if present. It MUST exclude all fields not passed to the host, such as possible crypto protocol MAC footers.

It MUST be possible for the host to independently verify checksum correctness by computing the same sum in software. This is impossible if the checksum includes bytes removed by the device, such as an Ethernet FCS.

Legacy devices MAY instead return only a boolean value with the packet that signals whether a checksum was successfully verified. This approach is strongly discouraged. If this approach is chosen, then the device MUST checksum only the outermost TCP or non-zero UDP checksum (if it verifies a checksum at all) and MUST return true only if this checksum can be verified. The device SHOULD then compute the sum over the pseudo-header, L4 header and payload, including the checksum field, and verify that this sums up to zero. Note that both negative and positive zero MUST be interpreted as valid sums, for all protocols except UDP. Only for UDP does the all-zeroes checksum 0x0000 indicate that the checksum should not be verified. An implementation returning a PILOC sum does not require extra logic to address these protocol variations.

The device MUST pass all packets to the host, including those that appear to fail checksum verification. The host must be able to account, verify and report such packets.

<span id="checksum-conversion"></span>
===== Checksum Conversion =====

If a legacy device only returns a boolean value, then host software can derive from this plus the checksum field in the packet a running sum over the packet from that header onward. It can use this to verify any subsequent checksums without touching data.

<span id="segmentation-offload"></span>
=== Segmentation Offload ===

Segmentation offload (SO) allows a host to pass the same number of bytes to the device in fewer packets. Most host transmission cost is a per-packet that is incurred as each packet traverses the software protocol stack layers. In this path, payload is not commonly accessed and thus packet size is less relevant. SO amortizes the per-packet overhead.

If a device supports SO, the host may pass it substantially larger packets than can be sent on the network. The device breaks up these SO packets into smaller packets and transmits those.

Segmentation offload depends on having checksum offload enabled, because packet checksums have to be computed after segmentation.

<span id="copy-headers-and-split-payload"></span>
===== Copy Headers and Split Payload =====

In an abstract model of segmentation offload, the device splits SO packet payload into segment sized chunks and copies the SO packet protocol headers to each segment. We refer to this basic mechanism as copy-headers-and-split-payload (CH/SP). The host communicates an unsigned integer segment size to the device along with the packet. This field must be large enough to cover the L3 MTU range: 16b is customary, but not strictly required to meet this goal. If segment size is not a divisor of total payload length, then the last packet in the segment chain will be shorter. The device MUST NOT attempt to compute or derive segment size, because establishing that is a complex process of path MTU and transport MSS discovery, more suitable to be implemented in software in the host protocol stack.

CH/SP is a simplified model. For specific protocols, segmentation offload can have subtle exceptions in how protocol header fields must be updated after copy. This spec explicitly defines all cases that diverge from pure CH/SP. The ground truth is the software segmentation implementation in Linux v6.3. If the two disagree, that source code takes precedence.

<span id="tcp-segmentation-offload"></span>
==== TCP Segmentation Offload ====

A device MUST support TCP Segmentation Offload (TSO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature. The device MUST support TSO with TCP options.

The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between the IPv4 or IPv6 and TCP header. As with other fields, the device should treat these bytes as opaque and copy them unconditionally unless otherwise specified.

TCP is particularly suitable for segmentation offload because at the user interface TCP is defined as a bytestream. By this definition, the user may have no expectations of how data is segmented into packets, in contrast with datagrams or message based protocols.

TSO enables the host to send the largest possible IP packet to the device, ignoring any constraints on path maximum transmission unit (MTU) or negotiated TCP maximum segment size (MSS). The host TCP stack selects the current MSS for the TCP connection as segment size. This number may vary between connections and across a connection lifespan.

<span id="tcp-header-field-adjustments"></span>
===== TCP Header Field Adjustments =====

TSO requires protocol header changes to the TCP header after CH/SP:

* Sequence number: Sequence number of previous segment + segment size.
* Flags
** FIN, PSH are only reflected in the last segment, zero in all others
** CWR is only reflected in the first segment, zero in all others

''IP Header Field Adjustments''

IP protocols require these changes:

* IPv4 total length is updated to match the shorter payload
* IPv6 payload length is updated to match the shorter payload
* IPv4 packets must increment IP ID unless DF bit is set
* IPv4 packet checksum is recomputed

''Extension Header Field Adjustments''

Headers between the IPv4 or IPv6 header and TCP header MUST be copied as pure CH/SP.

Authenticated encryption has to happen after SO. IPSec ESP or PSP encryption headers must be copied in a pure CH/SP manner to each segment, for further processing by downstream inline encryption logic.

<span id="udp-segmentation-offload"></span>
==== UDP Segmentation Offload ====

A device SHOULD support UDP Segmentation Offload (USO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature.

USO allows sending multiple UDP datagrams in a single operation. The host passes to the device a UDP packet plus segment size field. The device splits the datagram payload on segment size boundaries and replaces the UDP header to each segment.

USO is NOT the same as UDP fragmentation offload (UFO). That sends a datagram larger than MTU size, by relying on IP fragmentation. UFO is out of scope of this spec. Unlike UFO, USO does not maintain ordering. Datagrams may arrive out of order, same as if they were sent one at a time.

The device SHOULD support IPv4 options and IPv6 extension headers in between the IPv4 or IPv6 and TCP header. The device SHOULD support IPSec ESP and PSP transport-layer encryption headers between IPv4 or IPv6 header and UDP header.

UDP forms the basis for multiple high transfer rate protocols, including HTTP/3 and QUIC, and video streaming protocols like RTP. These workloads benefit from SO and form a sizable fraction of Internet workload.

''Header Field Adjustments''

Beyond CH/SP, USO requires an update of the UDP length field for the last segment if the USO payload is not an exact multiple of the segment size. It also requires the same IP and extension header field adjustments as TCP. A device SHOULD support this. Optionally, a device MAY only support USO for packets where payload is an exact multiple of segment size. The host then has to ensure to only pass such packets to the device. This mechanism forms the basis for Protocol Independent Segmentation Offload, next.

<span id="protocol-independent-segmentation-offload"></span>
==== Protocol Independent Segmentation Offload ====

A device SHOULD support Protocol Independent Segmentation Offload (PISO), for both IPv4 and IPv6. It MUST be possible to enable or disable the feature.

PISO codifies the core CH/SP mechanism. It extends segmentation offload to transport protocols other than TCP and UDP, and to tunneling scenarios, where a stack of headers precede the inner transport layer. Many protocols can be supported purely with CH/SP.

In PISO, the host

# communicates a segment size to the device along with the large packet, as in TSO/USO.
# communicates also an inner payload offset piso_off to the device along with that packet.
# prepares any headers before piso_off as they need to appear after segmentation.

If any of the headers include a length field, PISO requires all segments to be the same size, because the host prepares the headers exactly as they appear on the wire. PISO does not adjust them.

If a payload size leaves a remainder after dividing by segment size, the host has to send two packets to the device: one PISO packet of payload length minus remainder, and a separate no-SO packet of remainder size. This is a software concern only.

<span id="interaction-with-checksum-offload"></span>
===== Interaction with Checksum Offload =====

piso_off is similar to, but separate from, checksum_start. It must be possible to configure both independently.

<span id="interaction-with-tso-and-uso"></span>
===== Interaction with TSO and USO =====

PISO can be combined with TSO and USO. Then piso_off points not to the start of the payload, but the start of the inner transport header, TCP or UDP. Then the protocol specific rules for the inner transport protocol must be respected. Any headers before piso_off must still be entirely ignored by the device and treated solely as CH/SP. The device cannot infer whether the offset points to a UDP or TCP header. Whether to apply pure PISO, PISO + TSO or PISO + USO will have to be communicated explicitly, e.g., with a field in a context descriptor.

PISO + TSO/USO can optionally be supported on some legacy devices that were not built with PISO in mind. If a device supports TSO with variable length IPv4 options or IPv6 extension headers, with an explicit descriptor field that passes the length of these extra headers, then this can be used to pass arbitrary headers for CH/SP processing (instead of only options or extension headers), including tunnels. In this case the device expects the outer IP or IPv6 header to be an SO header with a large length field, so not prepared for pure CH/SP. A driver can patch up this distinction from the PISO interface.

<span id="jumbogram-segmentation-offload"></span>
==== Jumbogram Segmentation Offload ====

The device SHOULD support IPv4 and IPv6 jumbogram SO packets that exceed the 64 KB maximum IP packet size.

IPv6 headers have a 16-bit payload length field, so the largest possible standard IPv6 packet is 64 KB + IPv6 header (payload length includes IPv6 extension headers, if any). IPv4 headers have a 16bit total length field, so the largest possible IPv4 packet is slightly smaller: 64KB including header.

Jumbogram segmentation offload ignores the IPv6 payload length and IPv4 total length fields if zero. The host must then communicate the real length of the entire packet to the device out-of-band of the packet, likely as a descriptor field.

RFC 2675 defines an IPv6 jumbo payload option, with which IPv6 packets can support up to 4GB of payload. This configuration sets the payload length field to zero and appends a hop-by-hop next header with jumbo payload option. Unlike for IPv6 jumbograms that are sent as jumbograms on the wire, it is NOT necessary for IPv6 jumbo segmentation offload to include this jumbo payload hop-by-hop next header, as the segments themselves will not be jumbograms.

<span id="receive-segment-coalescing"></span>
=== Receive Segment Coalescing ===

The device MAY support Receive Segment Coalescing (RSC). If the device supports this feature, it MUST follow the below rules on packet combining.

Receive Segment Coalescing reduces packet rate from device to host by building a single large packet from multiple consecutive packet payloads in the same stream. The concept applies well to TCP, which defines payload as a contiguous byte stream.

The feature is also referred to as Large Receive Offload (LRO) and Hardware Generic Receive Offload (HW-GRO). The three mechanisms can differ in the exact rules on when and how to coalesce. RSC and LRO are originally defined only for TCP/IP. This section defines a broader set of rules. It takes the software Generic Receive Offload (GRO) in Linux v6.3 as ground truth. If the two disagree, that source code takes precedence.

Receive Segment Coalescing is the common term for this behavior. To avoid confusion we do not introduce yet another different acronym. But the RSC rules defined here differ from those previously defined by Microsoft [ref_id:msft_rsc]. At a minimum, in the following ways:

* This spec generalizes to other protocols besides IP and TCP
* This spec requires all TCP options to be supported

<span id="segment-size"></span>
===== Segment size =====

The device MUST pass to the host along with the large (SO) packet, a segment size field that encodes the payload length of the original packets. This field implies that packets are only coalesced if they have the same size on the wire. Coalescing stops if a packet arrives of different size. If it is larger than the previous packets, it cannot be appended. If it is smaller, it can be. If segment size is not a divisor of the SO packet payload, then the remainder encodes the payload length of this last packet.

''Reversibility''

The segment size field is mandatory. It must be possible to reconstruct the original packet stream. This reversibility capability is a hard requirement, to be able to use RSC plus TSO/USP/PISO for forwarding without creating externally observable changes to the packet stream compared to when both offloads are disabled.

The ground rule is that receive offload must be the exact inverse of segmentation offload. That is, if TSO/USO/PISO splits a large packet into a chain of small ones, RSC will rebuild the exact same packet. The inverse also holds. An RSC packet forwarded to a device for transmission with TSO/USO/PISO will result in the same packets on the wire as arrived before RSC coalescing.

Reconstructing the original packet stream imposes constraints on header coalescing beyond segment size. Each operation has to be reversible at segmentation offload. When fields are identical, coalescing is a trivially reversible operation. All other cases are explicitly listed below, by protocol. In exceptional cases, only where explicitly stated, do we allow information loss by coalescing packets with fields that differ.

<span id="stateful"></span>
===== Stateful =====

Receive Segment Coalescing is not stateless. This specification does not prescribe concrete implementation. But in an abstract design, RSC maintains a table of RSC contexts. This specification does not state a minimum required number of contexts. Each RSC context can hold one SO packet. Each flow maps onto at most one context. When a packet arrives, it is compared to all contexts. See RSS for flow matching. If a context matches a flow, the next phase enters.

A packet matches a context if it matches the flow, is consecutive to the SO packet and all header fields match. Fields match if they are the same, with some protocol-specific exceptions to this rule, all listed below.

<span id="context-closure"></span>
==== Context Closure ====

An SO context closes if a packet matches the flow, but not the other conditions. Then the data is flushed to the host and the context released. In the common case, the SO packet and incoming packet are then passed to the host as two packets. In a few specific exception cases, the incoming packet is appended to the SO packet and the single larger SO packet is passed to the host. This special case MUST happen if all fields match, except for payload size, and payload size of the incoming packet is less than the previous segments. It then forms the valid remainder of the SO packet. The same SHOULD happen also if all fields match except for PSH or FIN and either or both of these is set on the incoming packet.

<span id="general-match-exceptions"></span>
===== General Match Exceptions =====

* Length: if shorter than previous, may be appended, then closes context.
* Checksums: must all have been verified before RO. Are ignored for packet matching.

<span id="tcp-header-field-exceptions"></span>
===== TCP Header Field Exceptions =====

* Sequence number: must be previous plus segment size.
* Flags: FIN and PSH bit only allow appending the packet, then close context.
* Flags: all other flag differences close context without append.

To state explicitly: Ack sequence number and TCP options must match.

''IP Header Field Exceptions''

* Fragmentation: fragmented packets are not coalesced. Detection of a first fragment closes the context for a flow, if open.
* The IP ID must either increment for each segment or be the same for all segments.
** The first is common. The second may be the result of segmentation offload.

To state explicitly: TTL, hop limit and flowlabel fields must match.

Contexts can also be closed if a maximum number of segments is reached. This maximum may be host configurable.

<span id="asynchronous-close"></span>
====== Asynchronous Close ======

Flows can also be closed asynchronously, due to one of two events. If the device applies RSC to a flow, it must set an expiry timer when the first packet opens an RSC context. The device must send the packet to the host no later than the timeout. The flow timeout value MUST be host readable and SHOULD be host configurable.

The host may also notify the device that it wants RSC to be disabled. Any outstanding context must then be closed asynchronously, in the same manner as if their timers expired.

<span id="so-packet-construction"></span>
==== SO packet construction ====

The device must adjust all protocol header length fields to match the length of the combined payload.

<span id="tcp-header-field-adjustments-1"></span>
===== TCP Header Field Adjustments =====

* Sequence number: Sequence number of the first segment.
* Checksum: undefined.
* Flags: FIN and PSH are set if present in the last segment.
* Flags: CWR is set if present in the first segment.

''IP Header Field Adjustments''

TSO requires protocol specific changes to the preceding IPv4 or IPv6 header of the last segment, if this is shorter than full mss:

* IPv4 total length is updated to match the SO packet.
** Or set to zero and the below jumbo rules apply.
* IPv6 payload length is updated to match the SO packet.
** Or set to zero and the below jumbo rules apply.
* IPv4 IP ID is the ID of the first segment.
* IPv4 checksum is valid.

<span id="jumbogram-receive-segmentation-offload"></span>
===== Jumbogram Receive Segmentation Offload =====

Devices SHOULD support coalescing of packet streams that exceed the maximum IPv4 or IPv6 packet size. Jumbogram RSC is the inverse of Jumbogram Segmentation Offload. It solves the length field limitation in the same way: the length field MUST be set to zero and the length communicated out-of-band, likely as a descriptor field.

Jumbogram RSC MUST only be applied if total length exceeds the IPv4 total length or IPv6 payload length field.

<span id="timestamping"></span>
=== Timestamping ===

The device MUST support hardware timestamping at line rate, on both ingress and egress. Timestamps MUST be taken as defined in IEEE 802.3-2022[ref_id:802.3_2022] clause 90.

<span id="measurement-plane"></span>
==== Measurement Plane ====

The vendor SHOULD measure and report any constant delay between the measurement and reference plane (i.e., network), as defined there. There MUST NOT be any variable length delay between measurement and reference plane exceeding 10ns.

On ingress, this implies that timestamps must be taken before any queueing. On egress, the inverse holds. In particular, measurement of a timestamp must not be subject to (PCIe) backpressure delay on communication of the transmit descriptor to the host.

<span id="clock"></span>
==== Clock ====

Timestamps are measurements of a device clock. Most device clock components conform to standard requirements for stratum 3 clocks, such as G-1244-CORE[ref_id:gr_1244_core] or ITU-T G.812[ref_id:g812] type IV. The device clock SHOULD conform to one of these and report this.

Before using the common terms in this domain, we first define them:

* Resolution is the quantity below which two samples are seen as equal. It is defined as a time interval (e.g., nsec). The range of values that can be expressed is defined in terms of wrap-around time. From this, a minimum bit-width can be derived. Resolution itself is not an integer storage size, however.
* Precision is the distribution of measurements. It indicates repeatability of measurements, and is affected by read uncertainty. Precision is also expressed as a time interval.
* Accuracy is the offset from the true value. A perfectly precise measurement may have a constant offset. In this context, for instance the offset from the measurement plane from the reference plane.

Clock resolution and precision MUST be 10 ns or better. The clock MUST NOT drift more than 10 ppm. This may require a temperature controlled device (TXCO, OXCO or otherwise), but implementation is not prescribed. The clock must have a wraparound no worse than the 64-bit PTPv1 format, which is 2^32 seconds or roughly 136 years.

The counter MUST be monotonically non-decreasing. That is, causality must be maintained: any packet B measured after another packet A at the same measurement plane cannot have a timestamp lower than the timestamp of A. A packet passing through two measurement planes X and Y (such as PHY Tx and Rx when looping through a switch) must have a timestamp at Y greater than or equal to the timestamp at X. Timestamps may be equal in particular if transmission rate is higher than clock accuracy.

<span id="clock-synchronization"></span>
==== Clock Synchronization ====

The device MUST support clock synchronization of host clock to device clock with at most 500 nsec uncertainty. Transmitting an absolute clock reading across a medium such as PCIe itself introduces variable delay that can exceed this bound. The device SHOULD bound this uncertainty, e.g., by implementing a hardware mechanism such as PCI Precision Time Measurement (PTM) [ref_id:pci_ptm]. The vendor MUST report this bound.

The device must expose a clock API to read and control the NIC clock. The device MUST expose at least operations to get absolute value, set absolute value and adjust frequency. These must match the behavior of the <code>gettimex64</code>, <code>adjtime</code> and <code>adjfine</code> or <code>adjfreq</code> operations as defined in Linux <code>ptp_clock_info</code>. The get value operation MUST be implemented as a sandwich algorithm where the device clock reading is reported in between two host clock reads, as described in the PCI PTM link protocol[ref_id:pci_express_5.0, sec 6.22.2]. The frequency adjustment operation MUST allow frequency adjustments at 1 part per billion resolution or better.

<span id="pps-in-and-out"></span>
===== PPS in and out =====

The device MUST support both a Pulse Per Second (PPS) input and output signal.

<span id="host-communication"></span>
==== Host Communication ====

Timestamps may be passed to the host in a truncated format consisting of only the N least significant bits. This N-bit counter MUST have a wraparound of 1 second or greater. This allows the host to extend timestamps received during this interval to the full resolution by reading the full device clock at this timescale.

<span id="receive"></span>
===== Receive =====

The device MAY support selective receive timestamping, where the host can install a packet filter to select a subset of packets to be timestamped. The device MUST support the option to timestamp all packets.

For RSC packets the timestamp reported MUST be the timestamp of the first segment. This extends IEEE 802.3 Ethernet timestamp measurement to Receive Segment Coalescing packets.

<span id="transmit"></span>
===== Transmit =====

Transmit timestamps SHOULD be passed by the device to the host in a transmit completion descriptor field. If the measurement takes place after the completion notification, the device may instead queue a separate second completion, or directly expose an MMIO timestamp register file to the host, if that design can sustain line rate measurement.

The device MAY require the host to explicitly request a timestamp for each packet, e.g., through a descriptor field.

For TSO packets, measurement happens after segmentation. As with all other timestamps, the timestamp MUST be taken for the first symbol in the message. This corresponds to the first segment.

<span id="applications"></span>
==== Applications ====

NIC hardware timestamping is essential to IEEE 1588 clock synchronization. Applications at hyperscale also include congestion control and distributed applications.

Delay based TCP congestion control takes network RTT as input signal. Measurement must be more precise than network delay, which in data centers can be tens of microseconds. Hyperscale deployment of advanced congestion control requires a significantly higher measurement rate than for PTP clock synchronization, since RTT estimates are per-connection and measurements taken on every packet.

NIC hardware timestamps also enable latency measurement of the NIC datapath itself. Incast is a significant concern in hyperscale environments. Concurrent connection establishment can cause queue build up in a NIC if the host CPU, memory or peripheral bus are out of resources. Latency instrumentation can give an earlier and more informative signal than drops alone.

Finally, distributed systems increasingly rely on high precision clock synchronization to offer strongly consistent scalable storage_id[sundial]. Microsoft FaRMv2_id[farm] and CockroachDB are two examples. Serializability in such databases depends on strict event ordering based on timestamps. Transactions can be committed only after a time uncertainty bound has elapsed. Key to scaling transaction rate is bounding this uncertainty.

<span id="traffic-shaping"></span>
=== Traffic Shaping ===

A device MUST implement ingress traffic shaping to mitigate incast impact on high priority traffic. It MAY implement egress traffic shaping to offload this task from the host CPU.

<span id="ingress"></span>
==== Ingress ====

An ingress queue can build up on the device due to incast. If a standing queue can build up in the device, the device SHOULD mitigate head of line blocking of high priority traffic, by prioritizing traffic based on IP DSCP bits. The device MUST offer at least two traffic bands and MUST support host configurable mapping of DSCP bits to band. The device SHOULD offer weighted round robin (WRR) dequeue with weights configurable by the host. It may implement strict priority. If so, this MUST include starvation prevention with a minimum of 10% of bandwidth for every queue.

<span id="egress"></span>
==== Egress ====

Transmit packet scheduling is discussed in the multi-queue section. The device may additionally offer hardware traffic shaping to offload traffic prioritization from the host CPU. This specification does not ask the device to implement explicit hardware meters, policers or priority queues.

<span id="earliest-departure-time"></span>
===== Earliest Departure Time =====

The device MAY support hardware traffic shaping by holding packets until a packet departure time: Earliest Departure Time (EDT)[ref_id:google_carousel]. The feature allows the host to specify a time until which the device must defer packet transmission. The same mechanism is also sometimes referred to as Frame Launch time or SO_TXTIME.

EDT can be implemented efficiently in an O(1) data structure. Device implementation is out of scope for this spec, but one potential design is in the form of a two-layer hierarchical timing wheel[ref_id:varghese_tw].

This feature relies on comparing packet departure time against a device clock. It thus depends on a device hardware clock and host clock synchronization as described in the section on timestamping. It requires a transmit descriptor field to encode the departure time.

If the device supports EDT, then it MUST implement this according to the following rules. It MUST send without delay packets which have no departure time set or for which the departure time is in the past. It MUST NOT send a packet with a departure time before that departure time under any conditions. Departure time resolution MUST be 2us or smaller. The device MUST be able to accept and queue packets with a departure time up to 50 msec in the future. This “time horizon” is based on congestion control algorithms’ forward looking window. The device likely also has a global maximum storage capacity. The requirement that departure times up to 50 msec must be programmable DOES NOT imply that the device has to support enough storage space to queue up to 50 msec of data: actual packet spacing may be sparse. It SHOULD NOT have a maximum per interval capacity. The vendor MUST report all such bounds. The device MAY support a special slot for queueing packets with a time beyond the time horizon, or it may choose to drop those. The device MUST expose a counter for all packets dropped by the timing wheel due to either resource exhaustion or departure time beyond the horizon. The device SHOULD signal in a transmit completion when a packet was dropped rather than sent.

<span id="protocol-support"></span>