netdev - Re: Use of 802.3ad bonding for increasing link throughput

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5344.1312998372@death>
Date:	Wed, 10 Aug 2011 10:46:12 -0700
From:	Jay Vosburgh <fubar@...ibm.com>
To:	Tom Brown <sa212+glibc@...onix.com>
cc:	netdev <netdev@...r.kernel.org>
Subject: Re: Use of 802.3ad bonding for increasing link throughput

Tom Brown <sa212+glibc@...onix.com> wrote:

>[couldn't thread with '802.3ad bonding brain damaged', as I've just signed
>up]
>
>So, under what circumstances would a user actually use 802.3ad mode to
>"increase" link throughput, rather than just for redundancy? Are there any
>circumstances in which a single file, for example, could be transferred at
>multiple-NIC speed? 

	Network load balancing, by and large, increases throughput in
aggregate, not for individual connections.

[...] The 3 hashing options are:
>
>- layer 2: presumably this always puts traffic on the same NIC, even in a
>LAG with multiple NICs? Should layer 2 ever be used?

	Perhaps the network is such that the destinations are not
bonded, and can't handle more than 1 interface's worth of throughput.
Having the "server" end bonded still permits the clients deal with a
single IP address, handle failures of devices on the server, etc.

>- layer2+3: can't be used for a single file, since it still hashes to the
>same NIC, and can't be used for load-balancing, since different IP
>endpoints go unintelligently to different NICs
>
>- layer3+4: seems to have exactly the same issue as layer2+3, as well as
>being non-compliant
>
>I guess my problem is in understanding whether the 802.3/802.1AX spec has
>any use at all beyond redundancy. Given the requirement to maintain frame
>order at the distributor, I can't immediately see how having a bonded
>group of, say, 3 NICs is any better than having 3 separate NICs. Have I
>missed something obvious?

	Others have answered this part already (that it permits larger
aggregate throughput to/from the host, but not single-stream throughput
greater than one interface's worth).  This is by design, to prevent out
of order delivery of packets.

	An aggregate of N devices can be better than 3 individual
devices in that it will gracefully handle failure of one of the devices
in the aggregate, and permits sharing of the bandwidth in aggregate
without the peers having to be hard-coded to specific destinations.

>And, having said that, the redundancy features seem limited. For hot
>standby, when the main link fails, you have to wait for both ends to
>timeout, and re-negotiate via LACP, and hopefully pick up the same
>lower-priority NIC, and then rely on a higher layer to request
>retransmission of the missing frame. Do any of you have any experience of
>using 802.1AX for anything useful and non-trivial?

	In the linux implementation, as soon as the link goes down, that
port is removed from the aggregator and a new aggregator is selected
(which may be the same aggregator, depending on the option and
configuration).  Language in 802.1AX section 5.3.13 permits us to
immediately remove a failed port from an aggregator without waiting for
LACP to time out.

>So, to get multiple-NIC speed, are we stuck with balance-rr? But
>presumably this only works if the other end of the link is also running
>the bonding driver?

	Striping a single connection across multiple network interfaces
is very difficult to do without causing packets to be delivered out of
order.

	Now, that said, if you want to have one TCP connection utilize
more than one interface's worth of throughput, then yes, balance-rr is
the only mode that may do that.  The other end doesn't have to run
bonding, but it must have sufficient aggregate bandwidth to accomodate
the aggregate rate (e.g., N slower devices feeding into one faster
device).

	Running balance-rr itself can be tricky to configure.  An
unmanaged switch may not handle multiple ports with the same MAC address
very well (e.g., sending everything to one port, or sending everything
to all the ports).  A managed switch must have the relevant ports
configured for etherchannel ("static link aggregation" in some
documentation), and the switch may balance the traffic when it leaves
the switch using its transmit algorithm.  I'm not aware of any switches
that have a round-robin balance policy, so the switch may end up hashing
your traffic anyway (which will probably drop some of your packets,
because you're feeding them in faster than the switch can send them out
after they're hashed to one switch port).

	It's possible to play games on managed switches and, e.g., put
each pair of ports (one at each end) into a separate VLAN, but schemes
like that will fail badly if a link goes down somewhere.

	If each member of the bond goes through a different unmanaged
and not interconnected switch, that may avoid those issues (and this was
a common configuration back in the 10 Mb/sec days; it's described in
bonding.txt in more detail).  That configuration still has issues if a
link fails.  Connecting systems directly, back-to-back, should also
avoid those issues.

	Lastly, balance-rr will deliver traffic out of order.  Even the
best case, N slow links feeding one faster link, delivers some small
percentage out of order (in the low single digits).

	On linux, the tcp_reordering sysctl value can be raised to
compensate, but it will still result in increased packet overhead, and
is not likely to be very efficient, and doesn't help with anything
that's not TCP/IP.  I have not tested balance-rr in a few years now, but
my recollection is that, as a best case, throughput of one TCP
connection could reach about 1.5x with 2 slaves, or about 2.5x with 4
slaves (where the multipliers are in units of "bandwidth of one slave").

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html