Beruflich Dokumente
Kultur Dokumente
In this day and age, where almost everything is connected to the World Wide Web, the
demands on networking (in general) are mushrooming. In the developed world its
common to be able to get 20 megabits per second connections on our mobile devices
and 50 megabits per second connections at home. By extension, the demands on
enterprise data centers are even higher (by at least three to four orders of magnitude)
as these central hubs are where traffic from the aforementioned individual end nodes
converge. Consider the act of flipping through a series of cloud-hosted HD photos on a
mobile device this can easily result in billions of packets being transferred (in fractions
of a second).
The good news is that our networking interfaces are getting bigger and faster. 40
gigabit per second Ethernet is currently being deployed, and work to finalize on 100
gigbit per second end point interfaces is currently underway.
As one might imagine, high throughput interfaces also call for link aggregation
aggregation in active-backup mode, or in active-active mode, depending on the
application. Link aggregation, for those who may be new to the concept, means making
two physical links look like one logical link at the L2 layer.
Red Hat Enterprise Linux has, for some time, provided users with a bonding driver to
achieve link aggregation. In fact, bonding works well for most applications. That said,
the bonding drivers architecture is such that the control, management, and data paths
are all managed in the kernel space limiting its flexibility.
So where am I headed with this? Well, you may have heard that Red Hat Enterprise
Linux 7 has introduced a team driver
The team driver is not trying to replicate or mimic the bonding driver, it has actually
been designed to solve the same problem(s) using a wholly different design and
different approach; an approach where special attention was paid to flexibility and
efficiency. The best part is that the configuration, management, and monitoring of team
driver is significantly improved with no compromise on performance, features, or
throughput.
Coming full circle (you read the title, right?) the team driver can pretty much be
summarized by this sentence: if you like bonding, you will love teaming.
Side by Side
Team driver supports all of the most commonly used features of bonding driver, and
supports many more features. The following table facilitates an easy side-by-side
comparison.
Feature
Bonding
Team
broadcast TX policy
Yes
Yes
round-robin TX policy
Yes
Yes
active-backup TX policy
Yes
Yes
Yes
Yes
hash-based TX policy
Yes
Yes
TX load-balancing support
(TLB)
Yes
Yes
VLAN support
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
D-Bus interface
No
Yes
MQ interface
No
Yes
No
Yes
No
Yes
logic in user-space
No
Yes
modular design
No
Yes
No
Yes
No
Yes
No
Yes
Limited
Full
Limited
Yes
extensibility
Hard
Easy
performance overhead
Low
Very
Low
RX load-balancing support
(ALB)
Yes
Planned
RX load-balancing support
(ALB) in bridge or OVS
No
Planned
Interested in giving it a shot? Its not that difficult to migrate from bonding to teaming.
Migration
To facilitate migration from bonding driver to team driver we have created a robust
migration script called bond2team. Please see manual pages of bond2team (man 1
bond2team) for available options. In essence this script allows existing deployments of
bonded interfaces to be moved to teamed interfaces seamlessly.
Demos
Curious to see a demo before you pull the trigger? While a link to the more technical
details associated with team driver can be found here you can see the team driver in
action here.
Performance
Performance
with64byte
packets
Performance
with 1KB
packets
Performance
with 64KB
packets
Average
Latency
eth0
1664.00Mb/s
(27.48%CPU)
8053.53Mb/s
(30.71%CPU)
9414.99Mb/s
(17.08%CPU)
54.7usec
eth1
1577.44Mb/s
(26.91%CPU)
7728.04Mb/s
(32.23%CPU)
9329.05Mb/s
(19.38%CPU)
49.3usec
bonded
(eth0+eth1
)
1510.13Mb/s
(27.65%CPU)
7277.48Mb/s
(30.07%CPU)
9414.97Mb/s
(15.62%CPU)
55.5usec
teamed
(eth0+eth1
)
1550.15Mb/s
(26.81%CPU)
7435.76Mb/s
(29.56%CPU)
9413.8Mb/s
(17.63%CPU)
55.5usec
Before I sign off I also wanted to share this table (above). In short, team driver
performance is largely equal to or better than respective bonding driver performance
where all other variables are held in check.
A very important part is the Team kernel driver (referred to as "driver" in the text).
The driver is part of the Linux Kernel (since version 3.3). Although it is designed to
be very slim, it has the key role in the project. Its motivation is to implement all
things "which should be done fast", mainly transmit and receive packet (skbs)
flows. The motto is: "If something can be done in userspace, do it in userspace".
To avoid confusion I want to emphasize the fact that skbs do not leave kernel
space, they are not copied to userspace. It would make no sense to do so and also
it would dramatically slow down the skb processing. Team driver also provides
Team Netlink API. This interface is implemented using Generic Netlink and
provides an ability for userspace to set up or change the driver behaviour and to
get events (like port state changes, etc) from the driver. The driver on its own has
no logic. It does not decide the behaviour on its own. That is the job of the
userpace application.
Lib
Another very important part is the Libteam lib (referred to as "lib" in the text). This lib is a part of the
Libteam project. This lib uses libnl and its primary purpose is to do userspace wrapping of Team
Netlink communication. The user does not have to know anything about the Netlink API. The user only
calls the provided function and the lib takes care of constructing and parsing Netlink messages. All
messages coming from the driver are cached so when the user requests some data, there is no need to
send and receive any messages.
Also, the lib wraps-up RT Netlink messages (such as newlink, dellink etc.). These are exported to the
user to provide a possibility to create or delete a Team driver instance, to add or remove a port etc., to
get or set hardware addresses of network interfaces, etc.
Note that it is not absolutely necessary to use this lib. An application can do the implementation of
Team Netlink API client side on its own. But that is not recommended. An application which wants to
communicate with the Team driver should use this lib.
teamd
teamd stands for "Team daemon". teamd is an application which is a part of the
Libteam project and it uses Libteam lib. It runs as a daemon and one instance
ofteamd works with one instance of Team driver (one team netdev, for
exampleteam0). The purpose is to implement various logic of Team's behaviour.
From the most basic ones such as round-robin, to more complex such as activebackup and load-balancing. The logic is implemented in teamd parts called
"runners". More about runners can be found later in the text.
teamd tries to be universal to provide various kinds of behaviour logic (which can
be extended by custom runners). Note that teamd is an optional part. Users can
write their own application to use the lib and implement the logic.
To download Libteam repository (containing Libteam lib and teamd) do:
$ git clone git://github.com/jpirko/libteam.git
Team Netlink
This section describes Team Netlink API. It's probably better to know how driver
and lib communicate with each other before they are described. There are two
types of messages. Port list and option list.
Port list message take place only for the driver to lib direction (data are readonly). Schema of the message looks like this:
port item
interface index (unsigned 32bit value)
changed flag (boolean) - tells if any of this port status value changed
removed flag (boolean) - tells if this port was removed so userspace
can react to that properly
link up (boolean) - link status taken directly from ethtool
speed (unsigned 32bit value) - speed in Mbps taken directly from
ethtool
duplex (unsigned 8bit value (0 or 1)) - duplex taken directly from
ethtool
port item
....
port item
....
Option list message takes place in both directions. In the driver to lib direction,
driver data options are exposed and transferred into userspace. In the lib to driver
direction the message is used to inform the driver what option values should be
changed and how. Schema of the message looks like this:
option item
name (string) - name of the option
changed flag (boolean) - tells if any of this option value changed
removed flag (boolean) - tells if this option was removed so userspace
can react to that properly
type (unsigned 8bit value) - determines the type of the
following datafield. Can be either unsigned 32bit value, string, boolean
or binary data.
data (dynamic type) - actual option value of a type determined
by typefield.
port interface index (optional, unsigned 32bit value) - in case the
option is "per-port" this field is set to appropriate port interface index.
More about per-port option will be stated later in the text.
array index (optional, unsigned 32bit value) - in case the option is
"array" this field is set to appropriate array index. More about array
option will be stated later in the text.
option item
....
option item
....
There are the following cases in which messages are sent: 1. lib requests the port
list - In this case the message contains all present ports. 2. lib requests the option
list - In this case the message contains all present options. 3. lib requests option
value change - Message contains one or more option items for those options
whose value should be changed in the driver. 4. driver port status change event Driver is notified by net core that its port device changed status. In this case the
message contains only the port which changed or was removed. 5. driver option
value change event - Happens either asynchronously or as a reaction to point 3. In this case the message contains only the options whose value changed or the
options which were removed.
In case of point 4. and 5. Netlink multicast facility is used so multiple userspace
apps can get the message. Applications can therefore monitor possible changes
other applications are performing.
teamnl
teamnl uses libteam and provides a wrapper for Team device Netlink
communication.
For a detailed list and descriptions of teamnl command line parameters, please
see the appropriate manpage (man 8 teamnl).
Driver
The driver allows creating its instance using RTNL. So one can easily create or
delete a team instance using the ip utility like this:
# ip link add name team0 type team
# ip link delete team0
A random hardware address is generated for a newly created device unless the
user explicitly specifies it.
Also, the ip utility can be used to add or remove ports like this:
# ip link set eth0 master team0
# ip link set eth0 nomaster
The Team driver uses the netdevice notification facility to catch events happening
on ports in order to be able to perform necessary actions in reaction to that.
Events such as port link change, port disappearance and so on. When an instance
of Team is created (call it for example team0) it looks like any other network
device. So one can perform any action on it as if it was an ordinary network
device (assign an IP addresses, use it in iptables rules, etc.). The difference is that
it is not by itself able to receive or transmit skbs. It uses other network devices
(typically ones representing real NICs) to do that.
Notice what happens when net core asks a Team driver instance (called
teamdev in the text) to transmit a skb. In an ordinary driver for a real
NIC, skb is pushed into HW and HW does the transmit. In the Team
driver, the chosen port-selection algorithm (see Team modes later in the
text) selects one port best suited for the skb transmit at a time. This
port netdevice is then asked to transmit skb. This process is transparent
for net core and for port netdevice as well.
On the receive side, the Team driver uses rx_handler hook in net core rx procedure to intercept
incoming skbs. The same hook is used by bonding, bridging, macvlan and openvswitch. Using this
hook, the Team driver changes the skb so it looks as if it came from the teamdev (the one who owns the
originating port netdevice). Net core code executed after that thinks that the skb actually came from the
teamdev.
The pressure is put on achieving the best possible performance. Therefore fast path (skb transmit and
receive paths) are lockless (only using RCU). Also, data are put into memory to achieve maximum
locality and minimum pointer dereferences.
Options
An option infrastructure is necessary for modular and transparent work with driver
options. It separates the code so the actual option holder does not have to know
anything about Netlink and Team Netlink core does not have to know anything
about option values. The main part is option object. This object contains the name
of the option, its type, the function for getting an option value (getter), the option
for setting an option value (setter), etc. This object is registered to option facility.
The facility then creates option object instance with a reference to the originating
option object. This instance is after that being used by Team Netlink core. In Team
Netlink messages, each option instance has own option item.
When a Netlink message of option list type with a request for an option value to
be set arrives, Netlink core parses the message and searches through all existing
instances. The desired instance is found and Netlink core calls the setter function
of option object to change the option value.
Also when option list is requested by lib, Netlink core composes the message by
iterating over all option instances calling their option object getter functions and
using those values.
There are two specialties which extend options. The first one is the per-port
option. In this case a per-port flag needs to be set for the option object. During the
registration, the option facility creates multiple option instances, one per each
port in teamdev. These options are used for example for enabling and disabling
ports (e.g. enable option).
The second one is the array option. This is similar to the per-port option, only the
number of instances is set to option object before registration. Each option
instance is indexed as if it was in an array.
Modes
Modes are implemented as separate modules. Team core exposes mode string
option by which userspace application can set desired mode by its name. Modes
implement handlers which are called from Team core. These handlers defines the
behaviour:
init - Handler called after mode is selected. Typical place for allocating
memory and registering mode-specific options.
exit - Called after mode is deselected
receive - Called from receive hook. This handler allows mode to look at
incoming skbs and to change them eventually.
transmit - Called from Team core transmit function. This is the place where
transmit port selection takes place.
port_enter - Called whenever a port is added into teamdev.
port_leave - Called whenever a port is removed from teamdev.
port_change_mac - Called when hardware address change is detected.
There are five modes defined:
broadcast - Basic mode in which all packets are sent via all available ports.
roundrobin - Basic mode with very simple transmit port-selecting algorithm
based on looping around the port list. This is the only mode able to run on
its own without userspace interactions.
random - Basic mode similar to the previous one. Transmit port is selected
randomly for each outgoing skb.
activebackup - In this mode, only one port is active at a time and able to
perform transmit and receive of skb. The rest of the ports are backup ports.
Mode exposes activeport option through which userspace application can
specify the active port.
loadbalance - A more complex mode used for example for LACP and
userspace controlled transmit and receive load balancing. LACP protocol is
part of the 802.3ad standard and is very common for smart switches.
Hashing is used to identify similar skbs. Hash length is 8 bits so there are
256 variants of skbs (multiple different skbs can fall into one hash). The
hash computation mechanism is BPF (Berkeley Packet Filter) based. This
mode exposes the following options:
bpf_hash_func - A binary option containing the BPF function doing the computation of the hash
from skb. Userspace can assemble and setup a function which computes a hash from desired
parts of skb. For example source and destination IP address, hardware addresses, etc.
lb_hash_stats and lb_port_stats - read-only array options that expose internal skb transmit and
receive counters for each of 256 hashes and each port. A userspace application can use these
counters to distinguish the load of ports and possibly rebalance hashes to different ports.
lb_stats_refresh_interval - this option is used to tell the driver how often it should refresh
previously described stats options. If stats of any port or hash changes, it is communicated into
userspace (similar to a port status change event).
lb_tx_hash_to_port_mapping - an array option used for hash-to-port mapping.
Allows a userspace application to tell the driver which port should be used for
transmitting skb which belongs to certain hash.
lb_tx_method - a string option of two possible values. The hash value
tells the driver to use computed hash directly to get transmit port.
Thehash_to_port_mapping value turns on hash-to-port mapping and
takes previous array option values into account in the process of
selecting a port to transmit the skb.
If users need another mode, because the modes listed above do not cover a
special cases that they need, then they can easily implement their own mode. The
mode API is very well defined and can be easily enhanced to suit other needs.
teamd
This is the major part of the project. If possible, it is very much preferred to
implement features in teamd rather than in Team driver. teamd can be looked at
as the "puppeteer" whereas teamdev is its "puppet".
For detailed list and descriptions of teamd command line parameters, please see
the appropriate manpage (man 8 teamd).
teamd takes care of teamdev (Team driver instance) creation and
deletion.teamdev is created during teamd start and it's destroyed
before teamdterminates. The name of teamdev (for example team0) is specified
in the config file. More about config will be said later in the text.
Link-watchers serve for link monitoring purposes. Depending on the particular type they use different
methods to find out if a port is capable of data transfers. In other words "if the link is up".
There are currently the following types:
ethtool - Uses Libteam lib to get port ethtool state changes.
arp_ping - ARP requests are sent through a port. If an ARP reply is received, the link is
considered to be up. Target IP address, interval and other options can be setup in teamd config.
nsna_ping - Similar to the previous, only it uses the IPv6 Neighbour Solicitation and Neighbour
Advertisement mechanism. This is an alternative to arp_ping and becomes handy in pure-IPv6
environments.
Either one link-watch is set for all ports or each port can have its own link-watch. This allows users
more complex setups. For example, port eth0 can use ethtoollink-watch and eth1 can use arp_ping.
User can also specify multiple link-watchers used at the same time. In that case, link is up if any of the
link-watchers reports the link up.
Runners
Runners determine the behaviour of the Team device. They operate using the
kernel Team mode they want. Runners watch for port link state changes
(propagated by the selected link-watch) and react to that. They may implement
other functionality as well.
The following runners can be used (Team driver modes are stated in parenthesis):
broadcast (broadcast) - Does almost nothing because it only says to put
teamdev into broadcast mode.
roundrobin (roundrobin) - Does almost nothing because it only says to put
teamdev into roundrobin mode.
random (random) - Does almost nothing because it only says to put
teamdev into roundrobin mode.
activebackup (broadcast) - Watches for link changes and selects active port
to be used for data transfers. Each port can be configured to have its
priority and to be "sticky" or not. Being "sticky" here means to not be deactivated even if a port with a better priority gains its link.
loadbalance (loadbalance) - To do passive load balancing, runner only
setups BPF hash function which will determine port for skb transmit. To do
active load balancing, runner moves hashes among available ports trying to
reach a perfect balance. lb_hash_stats array option is used to get statistics.
lb_tx_hash_to_port_mapping array option is used to map hashes to TX port.
teamdclt
teamd is configured using JSON config string. This can be passed to teamd either
on the command line or in a file. JSON format was chosen because it's easy to
specify (and parse) hierarchic configurations using it.
Example teamd config (teamd1.conf):
{
"device":
"team0",
"hwaddr":
"runner":
"10:22:33:44:55:66",
{"name": "activebackup"},
"link_watch": {
"name": "nsna_ping",
"interval": 200,
"missed_max": 15,
"target_host": "fe80::210:18ff:feaa:bbcc"
},
"ports":
{
"eth0": {
"prio": -10,
"sticky": true
},
"eth1": {
"prio": 100,
"link_watch": {"name": "ethtool"}
},
"eth2": {
"link_watch": {
"name": "arp_ping",
"interval": 100,
"missed_max": 30,
"source_host": "192.168.23.2",
"target_host": "192.168.23.1"
}
},
"eth3": {}
}
}
The config is pretty much self-explanatory. Only link_watch sections might look
confusing. So in this example, the default link-watch is nsna_ping, setup to send
NAs (Neighbour Advertisements) every 200 milliseconds. Maximum number of
missed replies is 15. If more replies are missed, link is then considered as down.
Ports eth1 and eth2 specify their own link-watchers.
Another example of teamd config (teamd2.conf):
{
"device": "team0",
"runner": {
"name": "lacp",
"active": true,
"fast_rate": true,
"tx_hash": ["eth", "ipv4", "ipv6"]
},
"link_watch": {"name": "ethtool"},
"ports": {"eth1": {}, "eth2": {}}
}
In this example the tx_hash section is worth mentioning. It specifies what parts of
skb should be used to compute the hash.
For a detailed list and descriptions of available config options, please see the
appropriate manpage (man 8 teamd.conf).
Example
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN mode DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode
# teamd -f teamd2.conf -k
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN mode DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode
DEFAULT qlen 1000
link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
3: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen
1000
link/ether 00:07:e9:11:22:33 brd ff:ff:ff:ff:ff:ff
4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT
qlen 1000
link/ether 52:54:00:3d:c7:6d brd ff:ff:ff:ff:ff:ff
5: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT
qlen 1000
link/ether 52:54:00:73:15:c2 brd ff:ff:ff:ff:ff:ff