blob: 925d78667d2d1a5b46939580837eadac640f0835 [file] [log] [blame]
README - Myricom 10GbE driver for Linux
Contents:
I. Installation
II. Performance Tuning
III. Troubleshooting
IV. Compiling against another kernel
V. Compile-time options
VI. Load-time options
This Myricom 10GbE driver for Myri-10G NICs is intended for use only with
Linux kernel version 2.6 or later. It has been tested with Red Hat
Enterprise Linux 4, and several kernel.org kernels.
I. Installation
===============
To build the driver, type
% cd myri10ge/linux
% make clean
% make
% su root
# make install-only
To compile against a kernel that is different than the current running
kernel, see the "Compiling against another kernel" section below.
To load the Myricom 10GbE driver, type the command
# modprobe myri10ge
A new ethernet interface, having a MAC address beginning with 00:60:DD,
should now appear in the output of ifconfig -a . For example:
# ifconfig -a | grep 00:60:DD
eth2 Link encap:Ethernet HWaddr 00:60:DD:47:E5:31
In the examples below, we will assume our device is named "eth2".
If the driver fails to load, refer to the "Troubleshooting" section
below. If an error occurs during the installation procedure or at run-time,
please send the output of myri10ge_bugreport.sh to help@myri.com.
II. Performance Tuning
======================
In addition to the suggestions below, please see
http://www.myri.com/cgi-bin/fom?file=511#linux for additional performance
tuning recommendations.
A. Jumbo Frames
------------------
Using jumbo frames can greatly reduce the host overhead for 10 Gigabit
ethernet. To use jumbo frames, you must ensure all layer-2 equipment
(switches, bridges, etc) on your local LAN are configured to also
use jumbo frames. Jumbo frames can be enabled in any of 3 ways:
- Compile time: Build and install the driver as follows, and it
will use jumbo frames by default:
$ make MYRI10GE_JUMBO=1
$ su
# make install-only
- Load time: Use the myri10ge_initial_mtu module parameter to enable
jumbo frames:
modprobe myri10ge myri10ge_initial_mtu=9000
- Distribution network configuration files: You can also enable jumbo
frames in a distribution dependent way by editing the appropriate
configuration file. This will vary by Linux distribution.
Red Hat based distributions (RHEL, Centos, and Fedora Core) typically
configure the MTU using a line in the appropriate
/etc/sysconfig/network-script/ifcfg-ethX file. Assuming the myri10ge
interface is named eth2, you would edit
/etc/sysconfig/network-scripts/ifcfg-eth2 file, and add the line:
MTU=9000
and then reboot the system.
We suggest you consult your Linux vendor's documentation for specifics
on enabling jumbo frames in your specific Linux version.
B. Write Combining
------------------
Enabling Write Combining (WC) on the device's memory range will
improve performance.
Running the command
ethtool -S eth2 | grep WC
will indicate if the driver was able to enable WC.
If WC is disabled, please see http://www.myri.com/cgi-bin/fom?file=416
for tips on how to allow the driver to enable it.
C. Network Buffer Sizes
-----------------------
For best performance, we recommend increasing several network buffer
sizes from their default values. Add the following lines to
/etc/sysctl.conf and execute the command "sysctl -p /etc/sysctl.conf".
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 250000
For best performance with a 1500 byte MTU on a LAN, we suggest
disabling TCP timestamps by adding the following line to
/etc/sysctl.conf and executing the command "sysctl -p /etc/sysctl.conf".
net.ipv4.tcp_timestamps = 0
D. Interrupt Coalescing
-----------------------
This driver is ethtool compliant, and the interrupt coalescing parameter
can be adjusted via "ethtool -C $DEVNAME rx-usecs $VALUE".
The default setting is a compromise between latency and cpu overhead.
You may wish to reduce rx-usecs if latency is more important and you are
using a low-latency switch or a point-to-point connection. Similarly,
you may wish to increase rx-usecs if you are interested in reducing
CPU overhead for large transfers. Note that rx-usecs controls both
transmit and receive coalescing.
If you are using a kernel prior to 2.6.15, and notice that increasing
rx-usecs results in a sharp decline in TCP performance, you may want
to increase the TSO window divisor by adding the following line to
/etc/sysctl.conf and executing the command "sysctl -p /etc/sysctl.conf".
net.ipv4.tcp_tso_win_divisor = 32
For example, for the best performance on opterons, you should set
rx-usecs to at least 75us (ethtool -C ethX rx-usecs 75). Also, we've
found that disabling TCP timestamps is very important on opterons.
Try sysctl net.ipv4.tcp_timestamps=0.
E. MSI versus Legacy Interrupts
-------------------------------
Enabling MSI interrupts will lower interrupt latency and can improve
performance under some workloads. Our driver will only request MSI
interrupts on chipsets it has confidence will work with MSI interrupts.
To use MSI interrupts, the Linux kernel must be compiled with MSI support
(CONFIG_PCI_MSI=y). To see if MSI interrupts were enabled, check
ethtool for the myri10ge device:
# ethtool -S eth2 | grep MSI
MSI: 1
A non-zero value indicates that an MSI interrupt is being used by our
device. However, if the value is 0 and dmesg shows a message like the
following, it means that the Linux kernel did not allow our device to
use MSI interrupts:
myri10ge: Error setting up MSI on device 0000:05:00.0, falling back to
xPIC
If you would like to force the use of MSI interrupts, you should load
the driver using:
# modprobe myri10ge myri10ge_msi=1
If MSI interrupts are still not enabled even when setting
myri10ge_msi=1, this may mean your Linux distribution disables MSI by
default on a global basis. Recent Ubuntu and Fedora Core versions are
known to do this. To enable MSI, you must add pci=msi to the kernel
parameters and reboot.
Note that if MSI interrupts were forced to be enabled, but the
interface now fails to pass traffic, you should revert to using xPIC
interrupts by reloading the driver without using myri10ge_msi=1,
and remove pci=msi from your kernel parameters.
If it's not possible to enable MSI interrupts with the specific Linux
release that you're using, you can make xPIC interrupts less expensive
by loading the driver with:
# modprobe myri10ge myri10ge_deassert_wait=0
or set it at runtime via
'echo 0 > /sys/module/myri10ge/myri10ge_deassert_wait'
Not using MSI or myri10ge_deassert_wait=0 costs about 500Mb/s in our
performance measurements for a single stream.
F. Module compilation
---------------------
If you are using Linux kernel version 2.6.16 or higher, you will see
improved receive performance if you change the definition of
MYRI10GE_ALLOC_ORDER to 2 or more. This will cause the driver to
allocate receive buffers from 2^MYRI10GE_ALLOC_ORDER contiguous pages.
This reduces the number of allocations that the driver will make, as
well as potentially reducing the number of IOMMU manipulations, at the
cost of making each allocation more expensive. Please note that if
the system is under heavy memory load, you will have an increased
likelihood of allocation failures because it is harder for the kernel
to provide contiguous pages.
To change the this parameter, rebuild by:
% make clean
% make MYRI10GE_ALLOC_ORDER=$ORDER
% su root
# make install-only
# rmmod myri10ge
# modprobe myri10ge
Where $ORDER ranges in value from 1..3.
A good value to choose is MYRI10GE_ALLOC_ORDER=2, as it results in
16KB allocations. This is the same size allocation as a driver which
does not use PAGE_SIZE buffers, and simply allocates 9KB jumbo
frames. You may want to experiment with making MYRI10GE_ALLOC_ORDER=3,
but this is a bit more likely to fail under heavy memory pressure.
G. Packet forwarding
---------------------
If your workload is primarily traffic forwarding or traffic analysis,
you should build the driver using the MYRI10GE_RX_SKBS=1 compile
option. This causes the driver to receive into standard skbufs,
rather than into pages attached to an skbuf. Using this option is
critical for forwarding standard MTU frames at line rate, and for
forwarding frames to interfaces whose drivers do not support
scatter-gather DMA. However, this option is incompatible with LRO,
and should therefor not be used on an endstation.
To change this parameter, rebuild by:
% make clean
% make MYRI10GE_RX_SKBS=1 MYRI10GE_LRO=0
% su root
# make install-only
# rmmod myri10ge
# modprobe myri10ge
H. MSI-X interrupts and Multiple Receive Queues
-------------------------------------------------
If your kernel, motherboard and NIC are MSI-X capable, you can take
advantage of hardware steering of incoming IPv4 traffic into multiple
sets of receive queues.
To check if your NIC is capable of MSIX, use lspci:
# lspci -v -d 14c1: | grep 'MSI-X'
If your NIC is MSI-X capable, you should see:
Capabilities: [d0] MSI-X: Enable- Mask- TabSize=128
To enable multiple sets of receive queues (called slices), load the
driver with the module parameter myri10ge_max_slices set to the number
of slices you want to use, or to -1, which will pick the optimal
number. Note that the driver will use a number of slices which is the
largest power of two equal to or below myri10ge_max_slices that it can
successfully configure. After loading the driver, you should see a
line mentioning the number of MSI-X IRQs in use in your kernel
messages log:
myri10ge 0000:05:00.0: 4 MSI-X IRQs, tx bndry 4096, fw myri10ge_rss_eth_z8e.dat, WC Enabled
Most Linux kernel versions with which we have tested will deliver all
MSI-X interrupts to the same CPU by default, which defeats the purpose
of steering packets into multiple queues. To achieve optimal results
you should manually bind the interrupt from each slice to a different
CPU core. We include an example script, msixbind.sh, which will bind
slice 0 to CPU 0, slice 1 to CPU 1, etc.
I. Intel Direct Cache Access (DCA)
----------------------------------
If you have a recent Intel server or workstation chipset, you may be
able to take advantage of DCA. DCA causes DMA writes posted by the
NIC to be prefetched into a CPU's cache, thereby reducing cache misses
and increasing CPU efficiency for received network traffic.
Support for DCA is provided by the Intel ioatdma driver. DCA support
was added to the ioatdma in the 2.6.24 series. Please make sure you
have both ioatdma and dca configured. Note that we have also found
that using ioatdma's copy offload for TCP actually degrades
performance at 10GbE speeds, so make sure to avoid configuring that
(or disable it at runtime via sysctl net.ipv4.tcp_dma_copybreak=2147483647)
Here is a snippet from a 2.6.24 .config file showing an optimal
configuration:
#
# DMA Devices
#
CONFIG_INTEL_IOATDMA=m
CONFIG_DMA_ENGINE=y
#
# DMA Clients
#
# CONFIG_NET_DMA is not set
CONFIG_DCA=m
If you have an older kernel or prefer not to recompile your 2.6.24 or
newer kernel, you may download the ioatdma driver from Intel (version
2.15 or newer). Build and install the ioatdma driver prior to
installing the myri10ge driver. Save the ioatdma source directory,
and point the Myri10GE build at it when building the Myri10GE driver
via the DCA_FLAGS argument to make.
# make DCA_FLAGS="-DCONFIG_DCA -I/path/to/ioatdma-<vers>/include"
# make install-only
After loading the driver, you can confirm that DCA is enabled via:
# ethtool -S eth2 | grep dca
dca_capable: 1
dca_enabled: 1
III. Troubleshooting
====================
If the recommendations below do not resolve the problem you have
encountered, please send a full description, along with the output of
myri10ge_bugreport.sh, to help@myri.com.
Large Receive Offload (LRO) is enabled by default. This will
interfere with forwarding TCP traffic. If you plan to forward TCP
traffic (using the host with the Myri10GE NIC as a router or bridge),
you must disable LRO. To disable LRO, load the myri10ge driver
with myri10ge_lro set to 0:
# modprobe myri10ge myri10ge_lro=0
Alternatively, you can disable LRO at runtime by disabling
receive checksum offloading via ethtool:
# ethtool -K eth2 rx off
The ability to saturate a 10GbE link depends on having sufficient
PCI-Express bandwidth. When loaded, our driver calculates the
available bus bandwidth (read DMA, write DMA, and simultaneous read
and write DMA) and stores it so that ethtool may retrieve it later. To
view the bus bandwidth, use the following command:
# ethtool -S eth2 | grep dma
Note that the reported bandwidth is measured in megabytes per second,
not megabits. This means that 10Gb/s corresponds to 1280MB/s.
This driver uses the Linux hotplug facility to load its firmware by
default. It will look in /lib/firmware (Redhat), or
/usr/lib/hotplug/firmware (SuSE) for a firmware image. The firmware
images are copied there at install time. If there is a problem
locating the firmware, the driver will fail to load, and you will see
a message like this on the console:
Myricom MYRI10GE driver 0000:05:00.0: Unable to load myri10ge_eth_z8e.dat
firmware image, status = -2
This may be caused by your distribution using a different location for
firmware. Please contact help@myri.com if you have a problem loading
firmware.
If the driver fails to load because of the unknown symbols
"release_firmware" and "request_firmware", this means that you need to
install the firmware loading module via "modprobe firmware_class".
Also, make sure your kernel is built with CONFIG_FW_LOADER= 'y' or 'm'.
As a workaround, you may wish to build the firmware into the myri10ge
kernel module itself. To do this, build the module using
MYRI10GE_BUILTIN_FW=1
# make MYRI10GE_BUILTIN_FW=1
If the driver fails to load because of the unknown symbols
"zlib_inflate", "zlib_inflateInit2", and "zlib_inflate_workspacesize",
this means you need to install the zlib module via
# modprobe zlib_inflate
If MSI interrupts were automatically enabled, but the interface fails
to pass traffic, you should revert to using xPIC interrupts by
reloading the driver using:
# modprobe myri10ge myri10ge_msi=0
If you are using 802.1q VLANs, and you see an error message in the
kernel log which looks like:
hw tcp v4 csum failed
you need to adjust the myri10ge_vlan_csum_fixup parameter. This
tunable parameter controls whether or not the driver corrects the
hardware checksum of received 802.1q VLAN tagged frames to account for
the extra 4 bytes of VLAN header. In kernel.org kernels 2.6.14 and
later, the Linux 802.1q VLAN module automatically does this
correction, so our driver does not need to. In earlier Linux kernels
(2.6.13 and earlier), however, the correction is not included, so our
driver needs to perform this modification. Thus, the
myri10ge_vlan_csum_fixup parameter defaults to true (non-zero) on
kernel versions prior to 2.6.14, and to false (zero) on newer kernel
versions. To enable the correction in the Myri10GE driver, reload the
driver using:
# modprobe myri10ge myri10ge_vlan_csum_fixup=1
Or you can adjust this at runtime using:
# echo 1 > /sys/module/myri10ge/myri10ge_vlan_csum_fixup
Or (depending on your kernel version):
# echo 1 >
/sys/module/myri10ge/parameters/myri10ge_vlan_csum_fixup
Similarly, replace "0" with "1" above to disable the correction.
TSO can potentially overwhelm the receiver and lead to packet loss and
retransmissions. If you see an increase in bandwidth after disabling
TSO, check your switch counters and settings to ensure flow control is
enabled. TSO can be disabled as follows:
# ethtool -K eth2 tso off
If you are using another vendor's driver which also sets up PAT write
combining, and that vendor's driver uses a different PAT index than
our driver, the other vendor's driver may note a conflict and run in
reduced performance mode. Examples of other drivers which use PAT
include the Nvidia graphics driver, and various Infiniband drivers.
One way to work around this PAT conflict is to change the PAT index
used by the Myri10GE driver by rebooting, and loading the module using
the myri10ge_pat_idx modparam to specify a different PAT index. We
currently default to 6, while some other vendors use 1. Good values
to try are 1, 4, 5, and 7.
IV. Compiling against another kernel
====================================
To build for kernel different than the installed kernel, assuming its `uname
-r` is 2.6.12-1-686 and its modules have been installed into
/lib/modules/2.6.12-1-686,
type
% make clean
% make KVER=2.6.12-1-686
...
To build against a kernel that has not been installed yet, but whose sources
are in <src> and have been built in <build> (possibly the same directory),
type
% make clean
% make KSRC=<src> KDIR=<build>
...
Be sure to always 'make clean' before compiling against another kernel since
the myri10ge_checks.h has to be regenerated according to the right kernel
headers before compiling.
V. Compile-time options
=======================
To rebuild the module in a non-default manner, simply type:
% make OPTION=value
% su root
# make install-only
# rmmod myri10ge
# modprobe myri10ge
where the following OPTIONs are available:
Option Values Default Meaning
------------ ------ ------- ---------
MYRI10GE_LRO 0, 1 1 Enable or disable LRO
MYRI10GE_BUILTIN_FW 0, 1 0 Build in firmware? (see above)
MYRI10GE_ALLOC_ORDER 0..3 0 Allocate pages of this "order", see
explanation above.
MYRI10GE_RX_SKBS 0, 1 0 Receive into skbufs? (see above)
MYRI10GE_THROTTLE 0, 1 0 Throttle transmit bandwidth
MYRI10GE_JUMBO 0, 1 0 Default to using a 9000 byte MTU
After rebuilding and re-installing the module, you can confirm the
module was built correctly by checking the compile options using
ethtool -S. The presence of LRO flushed indicates LRO is compiled
in, for all others, simply look for the option name in lower case
without the leading MYRI10GE_. For example, to confirm a driver is
compiled with MYRI10GE_ALLOC_ORDER=3, do the following:
# ethtool -S eth2 | grep alloc_order
alloc_order: 3
And to confirm the driver is using LRO:
# ethtool -S eth1 | grep 'LRO flushed'
LRO flushed: 11320658
VI. Load-time options
=====================
When loading the myri10ge module, you may change a variety of options
by appending them to the modprobe line:
# modprobe myri10ge OPTION=value
Option Values Default Meaning
------------ ------ ------- ---------
myri10ge_force_firmware 0, 1 0 Force firmware to assume that the
host provides aligned PCIe
completions.
myri10ge_fw_name string [a] Name of firmware image to load via
hotplug.
myri10ge_fw_names string [d] Comma separated list of names
of firmware images to load via
hotplug.
myri10ge_ecrc_enable 0,1 1 Enable ECRC on parent bridge if
needed.
myri10ge_msi -1,0,1 -1 Enable use of MSI interrupts.
myri10ge_force_nvidia_msi 0,1 0 Forcibly enable MSI on Nvidia chipsets.
myri10ge_intr_coal_delay 0..N [b] Initial interrupt coalescing delay
in usecs.
myri10ge_flow_control 0, 1 1 Enable flow control.
myri10ge_deassert_wait 0, 1 1 Wait for xPIC interrupt
deassertion before exiting
interrupt handler.
myri10ge_initial_mtu 128..9000 9000 Initial default MTU.
myri10ge_vlan_csum_fixup 0,1 [c] Do VLAN Checksum fixup for
received frames.
myri10ge_max_slices -1,1..N 1 How many sets of receive
queues to use per NIC.
myri10ge_pat_idx 1,4..7 6 Which PAT idx to use to setup WC
myri10ge_big_skb_limit 0,2-512 0 Limit number of skbs allocated
to big rx ring. 0 means unlimited.
myri10ge_lro 0,1 1 Use TCP LRO (where available)
myri10ge_gro 0,1 1 Use GRO (where available)
a: This defaults to myri10ge_eth_z8e.dat or myri10ge_ethp_z8e.dat depending
on the host bridge chip in your machine.
b: This defaults to 25us for kernels older than 2.6.15, and 75us for
newer kernels.
c: This defaults to 1 for kernels older than 2.6.14
d: Firmware images are specified as myri10ge_fw_names=image1.dat,image2.dat
In this example, image1.dat is loaded on the first myri10ge NIC found by
the driver, and image2.dat is loaded on the next myri10ge NIC
found. Note that this option overrides both the defaults [a] and
any global default specified by myri10ge_fw_name.