PDA

View Full Version : [ubuntu] 802.3ad bonding



MrMakealotofsmoke
August 2nd, 2011, 03:05 PM
Hey all, i have a Dell 5448 switch with 802.3ad and LACP. Im trying to trunk my 2 onboard realtek gig ports on my fileserver which is running ubuntu 10.10.

The issue im having is that it seems to use only one nic at a time. Im viewing the trunk usage from the switch. My windows 2008 server with dual broadcomn netextreme nic's uses about 50% on each nic while using iperf/jperf. The ubuntu box uses 100% on 1 and 0% on the other. I figured it was because its using layer 2 XOR and i couldnt get load balancing from the same MAC, yet windows seems to be handling it fine? Is this the normal behaviour in linux?

cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 17
Partner Key: 49
Partner Mac Address: 00:21:9b:bb:28:68

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:d0:2d:c5:cd
Aggregator ID: 1

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:d0:2d:c5:bd
Aggregator ID: 1


/etc/network/interfaces

auto lo
iface lo inet loopback

auto bond0
iface bond0 inet static
address 192.168.0.11
netmask 255.255.255.0
gateway 192.168.0.1
up /sbin/ifenslave bond0 eth0 eth1
down /sbin/ifenslave -d bond0 eth0 eth1
dns-nameservers 192.168.0.1

/etc/modprobe.d/aliases.conf

alias bond0 bonding
options bonding mode=4 miimon=100 lacp_rate=1 xmit_hash_policy=1


ifconfig


bond0 Link encap:Ethernet HWaddr 00:1f:d0:2d:c5:cd
inet addr:192.168.0.11 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::21f:d0ff:fe2d:c5cd/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:16491761 errors:0 dropped:0 overruns:0 frame:0
TX packets:9722475 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1370044450 (1.3 GB) TX bytes:2766596650 (2.7 GB)

eth0 Link encap:Ethernet HWaddr 00:1f:d0:2d:c5:cd
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:417354 errors:0 dropped:0 overruns:0 frame:0
TX packets:3804753 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:39167816 (39.1 MB) TX bytes:206393544 (206.3 MB)
Interrupt:45 Base address:0xa000

eth1 Link encap:Ethernet HWaddr 00:1f:d0:2d:c5:cd
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:16074407 errors:0 dropped:0 overruns:0 frame:0
TX packets:5917722 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1330876634 (1.3 GB) TX bytes:2560203106 (2.5 GB)
Interrupt:46 Base address:0xe000

jmoorse
August 2nd, 2011, 07:20 PM
Hi, good question! Just to make sure I understand, from the viewpoint of the Ubuntu server ingress traffic is balanced (from the switch) but egress traffic is only using one link, correct?

It is possible this is causing the issue:



xmit_hash_policy=1


After doing a little research on the hash policy here: http://www.centos.org/docs/5/html/5.2/Deployment_Guide/s3-modules-bonding-directives.html

It appears layer3+4 may not work with all lacp implementations. Bug here: https://bugzilla.redhat.com/show_bug.cgi?id=586557 suggests using the value of:



xmit_hash_policy=layer2+3


The bug also has some syntactical suggestions you may want to review.

MrMakealotofsmoke
August 3rd, 2011, 04:45 AM
Tried that, still the same.

Interesting, using the linux box as the server and the windows box as the client results in the windows box using both nics and the linux box using one.
http://image.nizzlebix.com/files/1/bonding1.png

If i switch it around (linux as client windows as server) they both only use 1 nic each.
http://image.nizzlebix.com/files/1/bonding2.png

g1+g2 is the linux box
g3+g4 is windows

jmoorse
August 3rd, 2011, 05:38 AM
Please share the iperf commands used to generate the flows.

MrMakealotofsmoke
August 3rd, 2011, 03:34 PM
iperf -sD
iperf -sD -p 5002

iperf -c x.x.x.x -t 100 -i 1 -P 2
iperf -c x.x.x.x -t 100 -i 1 -P 2 -p 5002

psusi
August 3rd, 2011, 04:47 PM
From the kernel documentation on bonding:



802.3ad or 4

IEEE 802.3ad Dynamic link aggregation. Creates
aggregation groups that share the same speed and
duplex settings. Utilizes all slaves in the active
aggregator according to the 802.3ad specification.

Slave selection for outgoing traffic is done according
to the transmit hash policy, which may be changed from
the default simple XOR policy via the xmit_hash_policy
option, documented below. Note that not all transmit
policies may be 802.3ad compliant, particularly in
regards to the packet mis-ordering requirements of
section 43.2.4 of the 802.3ad standard. Differing
peer implementations will have varying tolerances for
noncompliance.


The hash policy uses the destination address to decide which interface to use to send, which has the affect of sending all traffic to a given host using the same interface. It seems that the bonding driver is brain damaged. It should not be using a hash at all in 802.3ad mode; it should be using whatever interface has the shortest tx queue at the time, regardless of the packet's destination address.

MrMakealotofsmoke
August 4th, 2011, 02:29 AM
From the kernel documentation on bonding:



802.3ad or 4

IEEE 802.3ad Dynamic link aggregation. Creates
aggregation groups that share the same speed and
duplex settings. Utilizes all slaves in the active
aggregator according to the 802.3ad specification.

Slave selection for outgoing traffic is done according
to the transmit hash policy, which may be changed from
the default simple XOR policy via the xmit_hash_policy
option, documented below. Note that not all transmit
policies may be 802.3ad compliant, particularly in
regards to the packet mis-ordering requirements of
section 43.2.4 of the 802.3ad standard. Differing
peer implementations will have varying tolerances for
noncompliance.
The hash policy uses the destination address to decide which interface to use to send, which has the affect of sending all traffic to a given host using the same interface. It seems that the bonding driver is brain damaged. It should not be using a hash at all in 802.3ad mode; it should be using whatever interface has the shortest tx queue at the time, regardless of the packet's destination address.

But it does say

which may be changed from
the default simple XOR policy via the xmit_hash_policy
optionwhich u'd think would fix that?

psusi
August 4th, 2011, 03:10 AM
But it does say
which u'd think would fix that?

None. The whole idea of using any hash policy is wrong.

jmoorse
August 7th, 2011, 03:15 AM
From the kernel documentation on bonding:

The hash policy uses the destination address to decide which interface to use to send, which has the affect of sending all traffic to a given host using the same interface. It seems that the bonding driver is brain damaged. It should not be using a hash at all in 802.3ad mode; it should be using whatever interface has the shortest tx queue at the time, regardless of the packet's destination address.

I respectfully disagree. Based on layer2, layer3, or layer2+3 hashing all traffic to the same host given the above iperf commands, and the assumption they client+server are on the same subnet will use the same slave regardless of tx queue length.




layer2+3

This policy uses a combination of layer2 and layer3
protocol information to generate the hash.

Uses XOR of hardware MAC addresses and IP addresses to
generate the hash. The formula is

(((source IP XOR dest IP) AND 0xffff) XOR
( source MAC XOR destination MAC ))
modulo slave count

This algorithm will place all traffic to a particular
network peer on the same slave. For non-IP traffic,
the formula is the same as for the layer2 transmit
hash policy.

This policy is intended to provide a more balanced
distribution of traffic than layer2 alone, especially
in environments where a layer3 gateway device is
required to reach most destinations.

This algorithm is 802.3ad complient.


I realize this doesn't help the OP, so let's try one more thing. Change the hash mode back to layer3+4 (not too worried about out of order segments in a lan) and run iperf with a bunch of threads. When I tested at home each thread uses a single client src port and increments by 1 to the next parallel instance. You may have had a scenario that with only 2 threads it was being hashed over the same link by layer3+4 xor.

It also may not hurt to run a second server daemon on a different port.

Here is what I am suggesting to try, note this may thrash your network so...run at your own risk.



Server:
iperf -sD -p 5002
iperf -sD -p 15003

Client:
iperf -c x.x.x.x -t 100 -i 1 -P 15 -p 5002
iperf -c x.x.x.x -t 100 -i 1 -P 15 -p 15003


See if iptraf shows usage of both slaves. Good luck, thanks

psusi
August 7th, 2011, 04:24 AM
I respectfully disagree. Based on layer2, layer3, or layer2+3 hashing all traffic to the same host given the above iperf commands, and the assumption they client+server are on the same subnet will use the same slave regardless of tx queue length.

That is exactly what I said. When using 802.3ad mode, the destination address ( layer 2, 3, or a combination of them ) should not matter; the packet *SHOULD* be sent to the interface with the shortest queue.

jmoorse
August 7th, 2011, 04:54 AM
Ok, I misunderstood what you meant the first time. You are saying that omitting the hashing will effectively do slave per-packet balancing right?

That would be nice, worth a test. I am concerned because further documentation on the linux bonding site says:



The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy. The
layer2+3 value was added for bonding version 3.2.2.


And we don't want that...

psusi
August 7th, 2011, 08:34 PM
Ok, I misunderstood what you meant the first time. You are saying that omitting the hashing will effectively do slave per-packet balancing right?

No, I am saying that the kernel is retarded, and apparently can not do the right thing.

hadenough
August 9th, 2011, 10:49 AM
No, I am saying that the kernel is retarded, and apparently can not do the right thing.
I'm pretty sure that the Linux driver is actually doing the right thing here. The problem is that 802.3ad/802.1ax requires all "conversations" to be transmitted in-order. The spec specifically requires all of a conversation to be transmitted through a single port to meet this requirement (you can change the port if necessary, but that's not relevant here). The implementer needs a distribution algorithm to choose the outgoing port, and the spec suggests a hash over various things. So, hashing is not a problem. Choosing a port based on Tx queue length will guarantee out-of-order conversations, so can't be done. balance-rr mode will stripe across ports, but isn't 802.3ad.
This leads to the obvious issue of how to actually use the driver to increase link speed, rather than to provide redundancy. I don't know the answer, and I'm just trying to figure it out myself. I don't know what a "conversation" is. I think, if you know what your protocol is, you hash such that a given "conversation" goes to a given port, and you get your throughput increase by running multiple conversations simultaneously. This isn't going to help with the Linux driver because you only get the layer 2 or the 2+3 policies.
Why does the Windows driver apparently "work"? No idea. Either they're non-compliant, or they're smarter with their understanding of conversations.
I'm just writing a (non-Linux) 803.3ad driver myself, so I need to figure this out. I need to do interop testing with Linux. If anyone needs any specific help/testing/assistance/whatever to get this going, mail me privately.

psusi
August 9th, 2011, 02:54 PM
From the conversation I started on the netdev mailing list, it seems it is indeed a problem with the standard. It does not require the packets to go out over a single link though. What it requires is that they arrive in the same order they were sent in. To do that means you either need some very carefully controlled cut-through switching, or you have to send all packets in the conversation over the same link. Since the kernel has to use store and forward switching, it has to use the latter method.

Windows does not have a generic bonding driver at all. There are some dual ported cards though and they have a driver that supports bonding, probably by configuring the hardware to use a single tx queue and the hardware then preserves the ordering.

hadenough
August 9th, 2011, 04:23 PM
It does not require the packets to go out over a single link though. What it requires is that they arrive in the same order they were sent in..

I think this is a mis-reading of the standard. It's not possible to guarantee frame ordering using any mechanism that is external to the distributor; it just wouldn't work in the vast majority of cases. The standard explicitly states, or implies, that the distributor must use a single link for a conversation in various places:

- 5.2.1, p13: "Frame ordering must be maintained for certain sequences of frame exchanges between MAC Clients (known as conversations, see Clause 3). The Distributor ensures that all frames of a given conversation are passed to a single port".

- 5.2.3, p14, states that the distributor is responsible for maintaining frame ordering requirements

- 5.2.4, p15: "The above requirement to maintain frame ordering is met by ensuring that all frames that compose a given conversation are transmitted on a single link in the order that they are generated by the MAC Client"

- the whole of Annex A would be redundant if the distributor could do this. In particular, everything on the dynamic reallocation of conversations to different ports would be pointless

- the whole marker protocol would be redundant if the distributor could choose multiple ports for a conversation

- and so on.

psusi
August 9th, 2011, 06:37 PM
I believe you are over-reading the provided example of a method you CAN use to meet the REQUIREMENT that ordering be preserved. It allows the distributor to switch back and forth between which link is used for a given conversation, so if you do that fast enough ( faster than it takes to send a single packet ), then you effectively are using both links.

hadenough
August 10th, 2011, 10:56 AM
I was trying to be polite. I have read the spec many times; I am implementing it. There is absolutely no question that your interpretation is wrong. I am not quoting a "provided example"; I am quoting normative text, and some informative text. It is possible to change ports during a conversation, as I pointed out. However, this is *not* a striping mechanism. It is a slow management mechanism that requires the use of a packet-level handshake or a timeout. The circumstances in which you might use it are listed in A.3, p106.

If you disagree with my reading of the text, then you need to provide a concrete example of the normative text, instead of asking us to believe you, when you clearly have little knowledge of the standard.

psusi
August 10th, 2011, 02:24 PM
If you disagree with my reading of the text, then you need to provide a concrete example of the normative text, instead of asking us to believe you, when you clearly have little knowledge of the standard.

Can you tell me where I can get a copy of it? I have not read it but the guys who have and used it to implement the kernel bonding driver seem to think that it only requires that the ordering be maintained.

You understand the difference between normative and informative right? The informative text is not part of the requirements. Also I don't see how it can require any kind of handshake or warning to the other end that you are about to switch a given conversation to the other link when the method of distribution ( which hash algorithm you are using in the case of the Linux kernel ), is entirely up to the sender, and you can change it at any time, and the receiver does not know or care about it.

jmoorse
August 12th, 2011, 06:40 AM
Off topic, both of you. I don't feel like this is helping the issue at hand.

OP, did you ever have a chance to run the other iperf commands a posted a while back to see if that helped? Thanks

psusi
August 12th, 2011, 07:53 PM
Off topic, both of you. I don't feel like this is helping the issue at hand.

OP, did you ever have a chance to run the other iperf commands a posted a while back to see if that helped? Thanks

I guess you didn't follow the conversation. Running iperf or anything else won't help because you can't split a single stream across both links. The best you can do is get one tcp connection to use one link while another connection uses the other link, and even doing that is iffy.

jmoorse
August 14th, 2011, 03:04 AM
As I suggested earlier, changing the hashing back to layer 3+4 and trying multiple iperf instances with different ports will create different flows.

The OPs original problem, as I read it, was lack of load sharing. If we can prove with iperf that this is possible then we can investigate other ways to get his applications to utilize both links.

MrMakealotofsmoke
June 19th, 2012, 07:12 AM
Only 1 year later :p

I gave up on bonding until recently when I setup an esxi box and found out that trunking works fine with it.

I upgraded from 10.10 to 11.04 in the process. I got it almost working with 10.10 and its the same with 11.04.

Current interfaces:


auto bond0
iface bond0 inet static
address 192.168.0.11
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
gateway 192.168.0.1
dns-nameservers 192.168.0.1
dns-search home.lan
#BOND, JAMES BOND
bond-lacp_rate 1
up /sbin/ifenslave bond0 eth3 eth4
down /sbin/ifenslave -d bond0 eth3 eth4

auto eth3
iface eth3 inet manual
bond-master bond0

auto eth4
iface eth4 inet manual
bond-master bond0


/etc/modprode.d/bonding.conf


alias bond0 bonding
options bonding mode=4 miimon=100 downdelay=200 updelay=200


What im getting now is this:

http://image.nizzlebix.com/files/1/stupidubuntubonding.PNG

Thats with 3 iperf's running from 3 different systems (this box being the client). It now load balances between the nics, but the overall speed doesnt go above gigabit :/

Switch is set to Layer 2/3 mode

MrMakealotofsmoke
June 19th, 2012, 08:49 AM
Changing to Layer3+4 fixed it. Wewt :guitar: