| Checksum Offloads in the Linux Networking Stack |
| |
| |
| Introduction |
| ============ |
| |
| This document describes a set of techniques in the Linux networking stack |
| to take advantage of checksum offload capabilities of various NICs. |
| |
| The following technologies are described: |
| * TX Checksum Offload |
| * LCO: Local Checksum Offload |
| * RCO: Remote Checksum Offload |
| |
| Things that should be documented here but aren't yet: |
| * RX Checksum Offload |
| * CHECKSUM_UNNECESSARY conversion |
| |
| |
| TX Checksum Offload |
| =================== |
| |
| The interface for offloading a transmit checksum to a device is explained |
| in detail in comments near the top of include/linux/skbuff.h. |
| In brief, it allows to request the device fill in a single ones-complement |
| checksum defined by the sk_buff fields skb->csum_start and |
| skb->csum_offset. The device should compute the 16-bit ones-complement |
| checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the |
| packet, and fill in the result at (csum_start + csum_offset). |
| Because csum_offset cannot be negative, this ensures that the previous |
| value of the checksum field is included in the checksum computation, thus |
| it can be used to supply any needed corrections to the checksum (such as |
| the sum of the pseudo-header for UDP or TCP). |
| This interface only allows a single checksum to be offloaded. Where |
| encapsulation is used, the packet may have multiple checksum fields in |
| different header layers, and the rest will have to be handled by another |
| mechanism such as LCO or RCO. |
| No offloading of the IP header checksum is performed; it is always done in |
| software. This is OK because when we build the IP header, we obviously |
| have it in cache, so summing it isn't expensive. It's also rather short. |
| The requirements for GSO are more complicated, because when segmenting an |
| encapsulated packet both the inner and outer checksums may need to be |
| edited or recomputed for each resulting segment. See the skbuff.h comment |
| (section 'E') for more details. |
| |
| A driver declares its offload capabilities in netdev->hw_features; see |
| Documentation/networking/netdev-features for more. Note that a device |
| which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start |
| and csum_offset given in the SKB; if it tries to deduce these itself in |
| hardware (as some NICs do) the driver should check that the values in the |
| SKB match those which the hardware will deduce, and if not, fall back to |
| checksumming in software instead (with skb_checksum_help or one of the |
| skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This |
| is a pain, but that's what you get when hardware tries to be clever. |
| |
| The stack should, for the most part, assume that checksum offload is |
| supported by the underlying device. The only place that should check is |
| validate_xmit_skb(), and the functions it calls directly or indirectly. |
| That function compares the offload features requested by the SKB (which |
| may include other offloads besides TX Checksum Offload) and, if they are |
| not supported or enabled on the device (determined by netdev->features), |
| performs the corresponding offload in software. In the case of TX |
| Checksum Offload, that means calling skb_checksum_help(skb). |
| |
| |
| LCO: Local Checksum Offload |
| =========================== |
| |
| LCO is a technique for efficiently computing the outer checksum of an |
| encapsulated datagram when the inner checksum is due to be offloaded. |
| The ones-complement sum of a correctly checksummed TCP or UDP packet is |
| equal to the sum of the pseudo header, because everything else gets |
| 'cancelled out' by the checksum field. This is because the sum was |
| complemented before being written to the checksum field. |
| More generally, this holds in any case where the 'IP-style' ones complement |
| checksum is used, and thus any checksum that TX Checksum Offload supports. |
| That is, if we have set up TX Checksum Offload with a start/offset pair, we |
| know that _after the device has filled in that checksum_, the ones |
| complement sum from csum_start to the end of the packet will be equal to |
| _whatever value we put in the checksum field beforehand_. This allows us |
| to compute the outer checksum without looking at the payload: we simply |
| stop summing when we get to csum_start, then add the 16-bit word at |
| (csum_start + csum_offset). |
| Then, when the true inner checksum is filled in (either by hardware or by |
| skb_checksum_help()), the outer checksum will become correct by virtue of |
| the arithmetic. |
| |
| LCO is performed by the stack when constructing an outer UDP header for an |
| encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for |
| the IPv6 equivalents, in udp6_set_csum(). |
| It is also performed when constructing an IPv4 GRE header, in |
| net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when |
| constructing an IPv6 GRE header; the GRE checksum is computed over the |
| whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be |
| possible to use LCO here as IPv6 GRE still uses an IP-style checksum. |
| All of the LCO implementations use a helper function lco_csum(), in |
| include/linux/skbuff.h. |
| |
| LCO can safely be used for nested encapsulations; in this case, the outer |
| encapsulation layer will sum over both its own header and the 'middle' |
| header. This does mean that the 'middle' header will get summed multiple |
| times, but there doesn't seem to be a way to avoid that without incurring |
| bigger costs (e.g. in SKB bloat). |
| |
| |
| RCO: Remote Checksum Offload |
| ============================ |
| |
| RCO is a technique for eliding the inner checksum of an encapsulated |
| datagram, allowing the outer checksum to be offloaded. It does, however, |
| involve a change to the encapsulation protocols, which the receiver must |
| also support. For this reason, it is disabled by default. |
| RCO is detailed in the following Internet-Drafts: |
| https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 |
| https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 |
| In Linux, RCO is implemented individually in each encapsulation protocol, |
| and most tunnel types have flags controlling its use. For instance, VXLAN |
| has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that |
| RCO should be used when transmitting to a given remote destination. |