| |
| 1. Control Interfaces |
| |
| The interfaces for receiving network packages timestamps are: |
| |
| * SO_TIMESTAMP |
| Generates a timestamp for each incoming packet in (not necessarily |
| monotonic) system time. Reports the timestamp via recvmsg() in a |
| control message as struct timeval (usec resolution). |
| |
| * SO_TIMESTAMPNS |
| Same timestamping mechanism as SO_TIMESTAMP, but reports the |
| timestamp as struct timespec (nsec resolution). |
| |
| * IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] |
| Only for multicast:approximate transmit timestamp obtained by |
| reading the looped packet receive timestamp. |
| |
| * SO_TIMESTAMPING |
| Generates timestamps on reception, transmission or both. Supports |
| multiple timestamp sources, including hardware. Supports generating |
| timestamps for stream sockets. |
| |
| |
| 1.1 SO_TIMESTAMP: |
| |
| This socket option enables timestamping of datagrams on the reception |
| path. Because the destination socket, if any, is not known early in |
| the network stack, the feature has to be enabled for all packets. The |
| same is true for all early receive timestamp options. |
| |
| For interface details, see `man 7 socket`. |
| |
| |
| 1.2 SO_TIMESTAMPNS: |
| |
| This option is identical to SO_TIMESTAMP except for the returned data type. |
| Its struct timespec allows for higher resolution (ns) timestamps than the |
| timeval of SO_TIMESTAMP (ms). |
| |
| |
| 1.3 SO_TIMESTAMPING: |
| |
| Supports multiple types of timestamp requests. As a result, this |
| socket option takes a bitmap of flags, not a boolean. In |
| |
| err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val); |
| |
| val is an integer with any of the following bits set. Setting other |
| bit returns EINVAL and does not change the current state. |
| |
| |
| 1.3.1 Timestamp Generation |
| |
| Some bits are requests to the stack to try to generate timestamps. Any |
| combination of them is valid. Changes to these bits apply to newly |
| created packets, not to packets already in the stack. As a result, it |
| is possible to selectively request timestamps for a subset of packets |
| (e.g., for sampling) by embedding an send() call within two setsockopt |
| calls, one to enable timestamp generation and one to disable it. |
| Timestamps may also be generated for reasons other than being |
| requested by a particular socket, such as when receive timestamping is |
| enabled system wide, as explained earlier. |
| |
| SOF_TIMESTAMPING_RX_HARDWARE: |
| Request rx timestamps generated by the network adapter. |
| |
| SOF_TIMESTAMPING_RX_SOFTWARE: |
| Request rx timestamps when data enters the kernel. These timestamps |
| are generated just after a device driver hands a packet to the |
| kernel receive stack. |
| |
| SOF_TIMESTAMPING_TX_HARDWARE: |
| Request tx timestamps generated by the network adapter. |
| |
| SOF_TIMESTAMPING_TX_SOFTWARE: |
| Request tx timestamps when data leaves the kernel. These timestamps |
| are generated in the device driver as close as possible, but always |
| prior to, passing the packet to the network interface. Hence, they |
| require driver support and may not be available for all devices. |
| |
| SOF_TIMESTAMPING_TX_SCHED: |
| Request tx timestamps prior to entering the packet scheduler. Kernel |
| transmit latency is, if long, often dominated by queuing delay. The |
| difference between this timestamp and one taken at |
| SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent |
| of protocol processing. The latency incurred in protocol |
| processing, if any, can be computed by subtracting a userspace |
| timestamp taken immediately before send() from this timestamp. On |
| machines with virtual devices where a transmitted packet travels |
| through multiple devices and, hence, multiple packet schedulers, |
| a timestamp is generated at each layer. This allows for fine |
| grained measurement of queuing delay. |
| |
| SOF_TIMESTAMPING_TX_ACK: |
| Request tx timestamps when all data in the send buffer has been |
| acknowledged. This only makes sense for reliable protocols. It is |
| currently only implemented for TCP. For that protocol, it may |
| over-report measurement, because the timestamp is generated when all |
| data up to and including the buffer at send() was acknowledged: the |
| cumulative acknowledgment. The mechanism ignores SACK and FACK. |
| |
| |
| 1.3.2 Timestamp Reporting |
| |
| The other three bits control which timestamps will be reported in a |
| generated control message. Changes to the bits take immediate |
| effect at the timestamp reporting locations in the stack. Timestamps |
| are only reported for packets that also have the relevant timestamp |
| generation request set. |
| |
| SOF_TIMESTAMPING_SOFTWARE: |
| Report any software timestamps when available. |
| |
| SOF_TIMESTAMPING_SYS_HARDWARE: |
| This option is deprecated and ignored. |
| |
| SOF_TIMESTAMPING_RAW_HARDWARE: |
| Report hardware timestamps as generated by |
| SOF_TIMESTAMPING_TX_HARDWARE when available. |
| |
| |
| 1.3.3 Timestamp Options |
| |
| The interface supports the options |
| |
| SOF_TIMESTAMPING_OPT_ID: |
| |
| Generate a unique identifier along with each packet. A process can |
| have multiple concurrent timestamping requests outstanding. Packets |
| can be reordered in the transmit path, for instance in the packet |
| scheduler. In that case timestamps will be queued onto the error |
| queue out of order from the original send() calls. It is not always |
| possible to uniquely match timestamps to the original send() calls |
| based on timestamp order or payload inspection alone, then. |
| |
| This option associates each packet at send() with a unique |
| identifier and returns that along with the timestamp. The identifier |
| is derived from a per-socket u32 counter (that wraps). For datagram |
| sockets, the counter increments with each sent packet. For stream |
| sockets, it increments with every byte. |
| |
| The counter starts at zero. It is initialized the first time that |
| the socket option is enabled. It is reset each time the option is |
| enabled after having been disabled. Resetting the counter does not |
| change the identifiers of existing packets in the system. |
| |
| This option is implemented only for transmit timestamps. There, the |
| timestamp is always looped along with a struct sock_extended_err. |
| The option modifies field ee_data to pass an id that is unique |
| among all possibly concurrently outstanding timestamp requests for |
| that socket. |
| |
| |
| SOF_TIMESTAMPING_OPT_CMSG: |
| |
| Support recv() cmsg for all timestamped packets. Control messages |
| are already supported unconditionally on all packets with receive |
| timestamps and on IPv6 packets with transmit timestamp. This option |
| extends them to IPv4 packets with transmit timestamp. One use case |
| is to correlate packets with their egress device, by enabling socket |
| option IP_PKTINFO simultaneously. |
| |
| |
| SOF_TIMESTAMPING_OPT_TSONLY: |
| |
| Applies to transmit timestamps only. Makes the kernel return the |
| timestamp as a cmsg alongside an empty packet, as opposed to |
| alongside the original packet. This reduces the amount of memory |
| charged to the socket's receive budget (SO_RCVBUF) and delivers |
| the timestamp even if sysctl net.core.tstamp_allow_data is 0. |
| This option disables SOF_TIMESTAMPING_OPT_CMSG. |
| |
| |
| New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to |
| disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate |
| regardless of the setting of sysctl net.core.tstamp_allow_data. |
| |
| An exception is when a process needs additional cmsg data, for |
| instance SOL_IP/IP_PKTINFO to detect the egress network interface. |
| Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on |
| having access to the contents of the original packet, so cannot be |
| combined with SOF_TIMESTAMPING_OPT_TSONLY. |
| |
| |
| 1.4 Bytestream Timestamps |
| |
| The SO_TIMESTAMPING interface supports timestamping of bytes in a |
| bytestream. Each request is interpreted as a request for when the |
| entire contents of the buffer has passed a timestamping point. That |
| is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record |
| when all bytes have reached the device driver, regardless of how |
| many packets the data has been converted into. |
| |
| In general, bytestreams have no natural delimiters and therefore |
| correlating a timestamp with data is non-trivial. A range of bytes |
| may be split across segments, any segments may be merged (possibly |
| coalescing sections of previously segmented buffers associated with |
| independent send() calls). Segments can be reordered and the same |
| byte range can coexist in multiple segments for protocols that |
| implement retransmissions. |
| |
| It is essential that all timestamps implement the same semantics, |
| regardless of these possible transformations, as otherwise they are |
| incomparable. Handling "rare" corner cases differently from the |
| simple case (a 1:1 mapping from buffer to skb) is insufficient |
| because performance debugging often needs to focus on such outliers. |
| |
| In practice, timestamps can be correlated with segments of a |
| bytestream consistently, if both semantics of the timestamp and the |
| timing of measurement are chosen correctly. This challenge is no |
| different from deciding on a strategy for IP fragmentation. There, the |
| definition is that only the first fragment is timestamped. For |
| bytestreams, we chose that a timestamp is generated only when all |
| bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to |
| implement and reason about. An implementation that has to take into |
| account SACK would be more complex due to possible transmission holes |
| and out of order arrival. |
| |
| On the host, TCP can also break the simple 1:1 mapping from buffer to |
| skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The |
| implementation ensures correctness in all cases by tracking the |
| individual last byte passed to send(), even if it is no longer the |
| last byte after an skbuff extend or merge operation. It stores the |
| relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff |
| has only one such field, only one timestamp can be generated. |
| |
| In rare cases, a timestamp request can be missed if two requests are |
| collapsed onto the same skb. A process can detect this situation by |
| enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at |
| send time with the value returned for each timestamp. It can prevent |
| the situation by always flushing the TCP stack in between requests, |
| for instance by enabling TCP_NODELAY and disabling TCP_CORK and |
| autocork. |
| |
| These precautions ensure that the timestamp is generated only when all |
| bytes have passed a timestamp point, assuming that the network stack |
| itself does not reorder the segments. The stack indeed tries to avoid |
| reordering. The one exception is under administrator control: it is |
| possible to construct a packet scheduler configuration that delays |
| segments from the same stream differently. Such a setup would be |
| unusual. |
| |
| |
| 2 Data Interfaces |
| |
| Timestamps are read using the ancillary data feature of recvmsg(). |
| See `man 3 cmsg` for details of this interface. The socket manual |
| page (`man 7 socket`) describes how timestamps generated with |
| SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. |
| |
| |
| 2.1 SCM_TIMESTAMPING records |
| |
| These timestamps are returned in a control message with cmsg_level |
| SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type |
| |
| struct scm_timestamping { |
| struct timespec ts[3]; |
| }; |
| |
| The structure can return up to three timestamps. This is a legacy |
| feature. Only one field is non-zero at any time. Most timestamps |
| are passed in ts[0]. Hardware timestamps are passed in ts[2]. |
| |
| ts[1] used to hold hardware timestamps converted to system time. |
| Instead, expose the hardware clock device on the NIC directly as |
| a HW PTP clock source, to allow time conversion in userspace and |
| optionally synchronize system time with a userspace PTP stack such |
| as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. |
| |
| 2.1.1 Transmit timestamps with MSG_ERRQUEUE |
| |
| For transmit timestamps the outgoing packet is looped back to the |
| socket's error queue with the send timestamp(s) attached. A process |
| receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE |
| set and with a msg_control buffer sufficiently large to receive the |
| relevant metadata structures. The recvmsg call returns the original |
| outgoing data packet with two ancillary messages attached. |
| |
| A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR |
| embeds a struct sock_extended_err. This defines the error type. For |
| timestamps, the ee_errno field is ENOMSG. The other ancillary message |
| will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This |
| embeds the struct scm_timestamping. |
| |
| |
| 2.1.1.2 Timestamp types |
| |
| The semantics of the three struct timespec are defined by field |
| ee_info in the extended error structure. It contains a value of |
| type SCM_TSTAMP_* to define the actual timestamp passed in |
| scm_timestamping. |
| |
| The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* |
| control fields discussed previously, with one exception. For legacy |
| reasons, SCM_TSTAMP_SND is equal to zero and can be set for both |
| SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It |
| is the first if ts[2] is non-zero, the second otherwise, in which |
| case the timestamp is stored in ts[0]. |
| |
| |
| 2.1.1.3 Fragmentation |
| |
| Fragmentation of outgoing datagrams is rare, but is possible, e.g., by |
| explicitly disabling PMTU discovery. If an outgoing packet is fragmented, |
| then only the first fragment is timestamped and returned to the sending |
| socket. |
| |
| |
| 2.1.1.4 Packet Payload |
| |
| The calling application is often not interested in receiving the whole |
| packet payload that it passed to the stack originally: the socket |
| error queue mechanism is just a method to piggyback the timestamp on. |
| In this case, the application can choose to read datagrams with a |
| smaller buffer, possibly even of length 0. The payload is truncated |
| accordingly. Until the process calls recvmsg() on the error queue, |
| however, the full packet is queued, taking up budget from SO_RCVBUF. |
| |
| |
| 2.1.1.5 Blocking Read |
| |
| Reading from the error queue is always a non-blocking operation. To |
| block waiting on a timestamp, use poll or select. poll() will return |
| POLLERR in pollfd.revents if any data is ready on the error queue. |
| There is no need to pass this flag in pollfd.events. This flag is |
| ignored on request. See also `man 2 poll`. |
| |
| |
| 2.1.2 Receive timestamps |
| |
| On reception, there is no reason to read from the socket error queue. |
| The SCM_TIMESTAMPING ancillary data is sent along with the packet data |
| on a normal recvmsg(). Since this is not a socket error, it is not |
| accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, |
| the meaning of the three fields in struct scm_timestamping is |
| implicitly defined. ts[0] holds a software timestamp if set, ts[1] |
| is again deprecated and ts[2] holds a hardware timestamp if set. |
| |
| |
| 3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP |
| |
| Hardware time stamping must also be initialized for each device driver |
| that is expected to do hardware time stamping. The parameter is defined in |
| /include/linux/net_tstamp.h as: |
| |
| struct hwtstamp_config { |
| int flags; /* no flags defined right now, must be zero */ |
| int tx_type; /* HWTSTAMP_TX_* */ |
| int rx_filter; /* HWTSTAMP_FILTER_* */ |
| }; |
| |
| Desired behavior is passed into the kernel and to a specific device by |
| calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose |
| ifr_data points to a struct hwtstamp_config. The tx_type and |
| rx_filter are hints to the driver what it is expected to do. If |
| the requested fine-grained filtering for incoming packets is not |
| supported, the driver may time stamp more than just the requested types |
| of packets. |
| |
| Drivers are free to use a more permissive configuration than the requested |
| configuration. It is expected that drivers should only implement directly the |
| most generic mode that can be supported. For example if the hardware can |
| support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale |
| HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT |
| is more generic (and more useful to applications). |
| |
| A driver which supports hardware time stamping shall update the struct |
| with the actual, possibly more permissive configuration. If the |
| requested packets cannot be time stamped, then nothing should be |
| changed and ERANGE shall be returned (in contrast to EINVAL, which |
| indicates that SIOCSHWTSTAMP is not supported at all). |
| |
| Only a processes with admin rights may change the configuration. User |
| space is responsible to ensure that multiple processes don't interfere |
| with each other and that the settings are reset. |
| |
| Any process can read the actual configuration by passing this |
| structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has |
| not been implemented in all drivers. |
| |
| /* possible values for hwtstamp_config->tx_type */ |
| enum { |
| /* |
| * no outgoing packet will need hardware time stamping; |
| * should a packet arrive which asks for it, no hardware |
| * time stamping will be done |
| */ |
| HWTSTAMP_TX_OFF, |
| |
| /* |
| * enables hardware time stamping for outgoing packets; |
| * the sender of the packet decides which are to be |
| * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE |
| * before sending the packet |
| */ |
| HWTSTAMP_TX_ON, |
| }; |
| |
| /* possible values for hwtstamp_config->rx_filter */ |
| enum { |
| /* time stamp no incoming packet at all */ |
| HWTSTAMP_FILTER_NONE, |
| |
| /* time stamp any incoming packet */ |
| HWTSTAMP_FILTER_ALL, |
| |
| /* return value: time stamp all packets requested plus some others */ |
| HWTSTAMP_FILTER_SOME, |
| |
| /* PTP v1, UDP, any kind of event packet */ |
| HWTSTAMP_FILTER_PTP_V1_L4_EVENT, |
| |
| /* for the complete list of values, please check |
| * the include file /include/linux/net_tstamp.h |
| */ |
| }; |
| |
| 3.1 Hardware Timestamping Implementation: Device Drivers |
| |
| A driver which supports hardware time stamping must support the |
| SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with |
| the actual values as described in the section on SIOCSHWTSTAMP. It |
| should also support SIOCGHWTSTAMP. |
| |
| Time stamps for received packets must be stored in the skb. To get a pointer |
| to the shared time stamp structure of the skb call skb_hwtstamps(). Then |
| set the time stamps in the structure: |
| |
| struct skb_shared_hwtstamps { |
| /* hardware time stamp transformed into duration |
| * since arbitrary point in time |
| */ |
| ktime_t hwtstamp; |
| }; |
| |
| Time stamps for outgoing packets are to be generated as follows: |
| - In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) |
| is set no-zero. If yes, then the driver is expected to do hardware time |
| stamping. |
| - If this is possible for the skb and requested, then declare |
| that the driver is doing the time stamping by setting the flag |
| SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with |
| |
| skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; |
| |
| You might want to keep a pointer to the associated skb for the next step |
| and not free the skb. A driver not supporting hardware time stamping doesn't |
| do that. A driver must never touch sk_buff::tstamp! It is used to store |
| software generated time stamps by the network subsystem. |
| - Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware |
| as possible. skb_tx_timestamp() provides a software time stamp if requested |
| and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). |
| - As soon as the driver has sent the packet and/or obtained a |
| hardware time stamp for it, it passes the time stamp back by |
| calling skb_hwtstamp_tx() with the original skb, the raw |
| hardware time stamp. skb_hwtstamp_tx() clones the original skb and |
| adds the timestamps, therefore the original skb has to be freed now. |
| If obtaining the hardware time stamp somehow fails, then the driver |
| should not fall back to software time stamping. The rationale is that |
| this would occur at a later time in the processing pipeline than other |
| software time stamping and therefore could lead to unexpected deltas |
| between time stamps. |