| Ethernet switch device driver model (switchdev) |
| =============================================== |
| Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> |
| Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com> |
| |
| |
| The Ethernet switch device driver model (switchdev) is an in-kernel driver |
| model for switch devices which offload the forwarding (data) plane from the |
| kernel. |
| |
| Figure 1 is a block diagram showing the components of the switchdev model for |
| an example setup using a data-center-class switch ASIC chip. Other setups |
| with SR-IOV or soft switches, such as OVS, are possible. |
| |
| |
| User-space tools |
| |
| user space | |
| +-------------------------------------------------------------------+ |
| kernel | Netlink |
| | |
| +--------------+-------------------------------+ |
| | Network stack | |
| | (Linux) | |
| | | |
| +----------------------------------------------+ |
| |
| sw1p2 sw1p4 sw1p6 |
| sw1p1 + sw1p3 + sw1p5 + eth1 |
| + | + | + | + |
| | | | | | | | |
| +--+----+----+----+-+--+----+---+ +-----+-----+ |
| | Switch driver | | mgmt | |
| | (this document) | | driver | |
| | | | | |
| +--------------+----------------+ +-----------+ |
| | |
| kernel | HW bus (eg PCI) |
| +-------------------------------------------------------------------+ |
| hardware | |
| +--------------+---+------------+ |
| | Switch device (sw1) | |
| | +----+ +--------+ |
| | | v offloaded data path | mgmt port |
| | | | | |
| +--|----|----+----+----+----+---+ |
| | | | | | | |
| + + + + + + |
| p1 p2 p3 p4 p5 p6 |
| |
| front-panel ports |
| |
| |
| Fig 1. |
| |
| |
| Include Files |
| ------------- |
| |
| #include <linux/netdevice.h> |
| #include <net/switchdev.h> |
| |
| |
| Configuration |
| ------------- |
| |
| Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model |
| support is built for driver. |
| |
| |
| Switch Ports |
| ------------ |
| |
| On switchdev driver initialization, the driver will allocate and register a |
| struct net_device (using register_netdev()) for each enumerated physical switch |
| port, called the port netdev. A port netdev is the software representation of |
| the physical port and provides a conduit for control traffic to/from the |
| controller (the kernel) and the network, as well as an anchor point for higher |
| level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using |
| standard netdev tools (iproute2, ethtool, etc), the port netdev can also |
| provide to the user access to the physical properties of the switch port such |
| as PHY link state and I/O statistics. |
| |
| There is (currently) no higher-level kernel object for the switch beyond the |
| port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. |
| |
| A switch management port is outside the scope of the switchdev driver model. |
| Typically, the management port is not participating in offloaded data plane and |
| is loaded with a different driver, such as a NIC driver, on the management port |
| device. |
| |
| Port Netdev Naming |
| ^^^^^^^^^^^^^^^^^^ |
| |
| Udev rules should be used for port netdev naming, using some unique attribute |
| of the port as a key, for example the port MAC address or the port PHYS name. |
| Hard-coding of kernel netdev names within the driver is discouraged; let the |
| kernel pick the default netdev name, and let udev set the final name based on a |
| port attribute. |
| |
| Using port PHYS name (ndo_get_phys_port_name) for the key is particularly |
| useful for dynamically-named ports where the device names its ports based on |
| external configuration. For example, if a physical 40G port is split logically |
| into 4 10G ports, resulting in 4 port netdevs, the device can give a unique |
| name for each port using port PHYS name. The udev rule would be: |
| |
| SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ |
| NAME="$attr{phys_port_name}" |
| |
| Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y |
| is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 |
| would be sub-port 0 on port 1 on switch 1. |
| |
| Switch ID |
| ^^^^^^^^^ |
| |
| The switchdev driver must implement the switchdev op switchdev_port_attr_get |
| for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same |
| physical ID for each port of a switch. The ID must be unique between switches |
| on the same system. The ID does not need to be unique between switches on |
| different systems. |
| |
| The switch ID is used to locate ports on a switch and to know if aggregated |
| ports belong to the same switch. |
| |
| Port Features |
| ^^^^^^^^^^^^^ |
| |
| NETIF_F_NETNS_LOCAL |
| |
| If the switchdev driver (and device) only supports offloading of the default |
| network namespace (netns), the driver should set this feature flag to prevent |
| the port netdev from being moved out of the default netns. A netns-aware |
| driver/device would not set this flag and be responsible for partitioning |
| hardware to preserve netns containment. This means hardware cannot forward |
| traffic from a port in one namespace to another port in another namespace. |
| |
| Port Topology |
| ^^^^^^^^^^^^^ |
| |
| The port netdevs representing the physical switch ports can be organized into |
| higher-level switching constructs. The default construct is a standalone |
| router port, used to offload L3 forwarding. Two or more ports can be bonded |
| together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge |
| L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 |
| tunnels can be built on ports. These constructs are built using standard Linux |
| tools such as the bridge driver, the bonding/team drivers, and netlink-based |
| tools such as iproute2. |
| |
| The switchdev driver can know a particular port's position in the topology by |
| monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a |
| bond will see it's upper master change. If that bond is moved into a bridge, |
| the bond's upper master will change. And so on. The driver will track such |
| movements to know what position a port is in in the overall topology by |
| registering for netdevice events and acting on NETDEV_CHANGEUPPER. |
| |
| L2 Forwarding Offload |
| --------------------- |
| |
| The idea is to offload the L2 data forwarding (switching) path from the kernel |
| to the switchdev device by mirroring bridge FDB entries down to the device. An |
| FDB entry is the {port, MAC, VLAN} tuple forwarding destination. |
| |
| To offloading L2 bridging, the switchdev driver/device should support: |
| |
| - Static FDB entries installed on a bridge port |
| - Notification of learned/forgotten src mac/vlans from device |
| - STP state changes on the port |
| - VLAN flooding of multicast/broadcast and unknown unicast packets |
| |
| Static FDB Entries |
| ^^^^^^^^^^^^^^^^^^ |
| |
| The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump |
| to support static FDB entries installed to the device. Static bridge FDB |
| entries are installed, for example, using iproute2 bridge cmd: |
| |
| bridge fdb add ADDR dev DEV [vlan VID] [self] |
| |
| The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx |
| ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using |
| switchdev_port_obj_xxx ops. |
| |
| XXX: what should be done if offloading this rule to hardware fails (for |
| example, due to full capacity in hardware tables) ? |
| |
| Note: by default, the bridge does not filter on VLAN and only bridges untagged |
| traffic. To enable VLAN support, turn on VLAN filtering: |
| |
| echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering |
| |
| Notification of Learned/Forgotten Source MAC/VLANs |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The switch device will learn/forget source MAC address/VLAN on ingress packets |
| and notify the switch driver of the mac/vlan/port tuples. The switch driver, |
| in turn, will notify the bridge driver using the switchdev notifier call: |
| |
| err = call_switchdev_notifiers(val, dev, info); |
| |
| Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when |
| forgetting, and info points to a struct switchdev_notifier_fdb_info. On |
| SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the |
| bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge |
| command will label these entries "offload": |
| |
| $ bridge fdb |
| 52:54:00:12:35:01 dev sw1p1 master br0 permanent |
| 00:02:00:00:02:00 dev sw1p1 master br0 offload |
| 00:02:00:00:02:00 dev sw1p1 self |
| 52:54:00:12:35:02 dev sw1p2 master br0 permanent |
| 00:02:00:00:03:00 dev sw1p2 master br0 offload |
| 00:02:00:00:03:00 dev sw1p2 self |
| 33:33:00:00:00:01 dev eth0 self permanent |
| 01:00:5e:00:00:01 dev eth0 self permanent |
| 33:33:ff:00:00:00 dev eth0 self permanent |
| 01:80:c2:00:00:0e dev eth0 self permanent |
| 33:33:00:00:00:01 dev br0 self permanent |
| 01:00:5e:00:00:01 dev br0 self permanent |
| 33:33:ff:12:35:01 dev br0 self permanent |
| |
| Learning on the port should be disabled on the bridge using the bridge command: |
| |
| bridge link set dev DEV learning off |
| |
| Learning on the device port should be enabled, as well as learning_sync: |
| |
| bridge link set dev DEV learning on self |
| bridge link set dev DEV learning_sync on self |
| |
| Learning_sync attribute enables syncing of the learned/forgotton FDB entry to |
| the bridge's FDB. It's possible, but not optimal, to enable learning on the |
| device port and on the bridge port, and disable learning_sync. |
| |
| To support learning and learning_sync port attributes, the driver implements |
| switchdev op switchdev_port_attr_get/set for |
| SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes |
| to the hardware defaults. |
| |
| FDB Ageing |
| ^^^^^^^^^^ |
| |
| The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is |
| the responsibility of the port driver/device to age out these entries. If the |
| port device supports ageing, when the FDB entry expires, it will notify the |
| driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the |
| device does not support ageing, the driver can simulate ageing using a |
| garbage collection timer to monitor FBD entries. Expired entries will be |
| notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for |
| example of driver running ageing timer. |
| |
| To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB |
| entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The |
| notification will reset the FDB entry's last-used time to now. The driver |
| should rate limit refresh notifications, for example, no more than once a |
| second. (The last-used time is visible using the bridge -s fdb option). |
| |
| STP State Change on Port |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Internally or with a third-party STP protocol implementation (e.g. mstpd), the |
| bridge driver maintains the STP state for ports, and will notify the switch |
| driver of STP state change on a port using the switchdev op |
| switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. |
| |
| State is one of BR_STATE_*. The switch driver can use STP state updates to |
| update ingress packet filter list for the port. For example, if port is |
| DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs |
| and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. |
| |
| Note that STP BDPUs are untagged and STP state applies to all VLANs on the port |
| so packet filters should be applied consistently across untagged and tagged |
| VLANs on the port. |
| |
| Flooding L2 domain |
| ^^^^^^^^^^^^^^^^^^ |
| |
| For a given L2 VLAN domain, the switch device should flood multicast/broadcast |
| and unknown unicast packets to all ports in domain, if allowed by port's |
| current STP state. The switch driver, knowing which ports are within which |
| vlan L2 domain, can program the switch device for flooding. The packet may |
| be sent to the port netdev for processing by the bridge driver. The |
| bridge should not reflood the packet to the same ports the device flooded, |
| otherwise there will be duplicate packets on the wire. |
| |
| To avoid duplicate packets, the device/driver should mark a packet as already |
| forwarded using skb->offload_fwd_mark. The same mark is set on the device |
| ports in the domain using dev->offload_fwd_mark. If the skb->offload_fwd_mark |
| is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel |
| will drop the skb right before transmit on the egress port, with the |
| understanding that the device already forwarded the packet on same egress port. |
| The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark |
| for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and |
| a group ifindex. |
| |
| It is possible for the switch device to not handle flooding and push the |
| packets up to the bridge driver for flooding. This is not ideal as the number |
| of ports scale in the L2 domain as the device is much more efficient at |
| flooding packets that software. |
| |
| If supported by the device, flood control can be offloaded to it, preventing |
| certain netdevs from flooding unicast traffic for which there is no FDB entry. |
| |
| IGMP Snooping |
| ^^^^^^^^^^^^^ |
| |
| In order to support IGMP snooping, the port netdevs should trap to the bridge |
| driver all IGMP join and leave messages. |
| The bridge multicast module will notify port netdevs on every multicast group |
| changed whether it is static configured or dynamically joined/leave. |
| The hardware implementation should be forwarding all registered multicast |
| traffic groups only to the configured ports. |
| |
| L3 Routing Offload |
| ------------------ |
| |
| Offloading L3 routing requires that device be programmed with FIB entries from |
| the kernel, with the device doing the FIB lookup and forwarding. The device |
| does a longest prefix match (LPM) on FIB entries matching route prefix and |
| forwards the packet to the matching FIB entry's nexthop(s) egress ports. |
| |
| To program the device, the driver implements support for |
| SWITCHDEV_OBJ_IPV[4|6]_FIB object using switchdev_port_obj_xxx ops. |
| switchdev_port_obj_add is used for both adding a new FIB entry to the device, |
| or modifying an existing entry on the device. |
| |
| XXX: Currently, only SWITCHDEV_OBJ_ID_IPV4_FIB objects are supported. |
| |
| SWITCHDEV_OBJ_ID_IPV4_FIB object passes: |
| |
| struct switchdev_obj_ipv4_fib { /* IPV4_FIB */ |
| u32 dst; |
| int dst_len; |
| struct fib_info *fi; |
| u8 tos; |
| u8 type; |
| u32 nlflags; |
| u32 tb_id; |
| } ipv4_fib; |
| |
| to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi |
| structure holds details on the route and route's nexthops. *dev is one of the |
| port netdevs mentioned in the routes next hop list. If the output port netdevs |
| referenced in the route's nexthop list don't all have the same switch ID, the |
| driver is not called to add/modify/delete the FIB entry. |
| |
| Routes offloaded to the device are labeled with "offload" in the ip route |
| listing: |
| |
| $ ip route show |
| default via 192.168.0.2 dev eth0 |
| 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload |
| 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload |
| 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 12.0.0.2 proto zebra metric 30 offload |
| nexthop via 11.0.0.1 dev sw1p1 weight 1 |
| nexthop via 11.0.0.9 dev sw1p2 weight 1 |
| 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload |
| 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload |
| 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 |
| |
| XXX: add/mod/del IPv6 FIB API |
| |
| Nexthop Resolution |
| ^^^^^^^^^^^^^^^^^^ |
| |
| The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for |
| the switch device to forward the packet with the correct dst mac address, the |
| nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac |
| address discovery comes via the ARP (or ND) process and is available via the |
| arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver |
| should trigger the kernel's neighbor resolution process. See the rocker |
| driver's rocker_port_ipv4_resolve() for an example. |
| |
| The driver can monitor for updates to arp_tbl using the netevent notifier |
| NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops |
| for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy |
| to know when arp_tbl neighbor entries are purged from the port. |
| |
| Transaction item queue |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| For switchdev ops attr_set and obj_add, there is a 2 phase transaction model |
| used. First phase is to "prepare" anything needed, including various checks, |
| memory allocation, etc. The goal is to handle the stuff that is not unlikely |
| to fail here. The second phase is to "commit" the actual changes. |
| |
| Switchdev provides an infrastructure for sharing items (for example memory |
| allocations) between the two phases. |
| |
| The object created by a driver in "prepare" phase and it is queued up by: |
| switchdev_trans_item_enqueue() |
| During the "commit" phase, the driver gets the object by: |
| switchdev_trans_item_dequeue() |
| |
| If a transaction is aborted during "prepare" phase, switchdev code will handle |
| cleanup of the queued-up objects. |