netdev - Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <28837.1295382268@death>
Date:	Tue, 18 Jan 2011 12:24:28 -0800
From:	Jay Vosburgh <fubar@...ibm.com>
To:	=?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= 
	<nicolas.2p.debian@...il.com>
cc:	"Oleg V. Ukhno" <olegu@...dex-team.ru>,
	John Fastabend <john.r.fastabend@...el.com>,
	"David S. Miller" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?= 
	<sebastien.barre@...ouvain.be>,
	Christophe Paasch <christoph.paasch@...ouvain.be>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

Nicolas de Pesloüan <nicolas.2p.debian@...il.com> wrote:

>Le 18/01/2011 16:28, Oleg V. Ukhno a écrit :
>> On 01/18/2011 05:54 PM, Nicolas de Pesloüan wrote:
>>> I remember a topology (described by Jay, for as far as I remember),
>>> where two hosts were connected through two distinct VLANs. In such
>>> topology:
>>> - it is possible to detect path failure using arp monitoring instead of
>>> miimon.

	I don't think this is true, at least not for the case of
balance-rr.  Using ARP monitoring with any sort of load balance scheme
is problematic, because the replies may be balanced to a different slave
than the sender.

>>> - changing the destination MAC address of egress packets are not
>>> necessary, because egress path selection force ingress path selection
>>> due to the VLAN.

	This is true, with one comment: Oleg's proposal we're discussing
changes the source MAC address of outgoing packets, not the destination.
The purpose being to manipulate the src-mac balancing algorithm on the
switch when the packets are hashed at the egress port channel group.
The packets (for a particular destination) all bear the same destination
MAC, but (as I understand it) are manually assigned tailored source MAC
addresses that hash to sequential values.

>> In case with two VLANs - yes, this shouldn't be necessary(but needs to
>> be tested, I am not sure), but within one - it is essential for correct
>> rx load striping.
>
>Changing the destination MAC address is definitely not required if you
>segregate each path in a distinct VLAN.
>
>            +-------------------+     +-------------------+
>    +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
>    |       +-------------------+     +-------------------+       |
>+------+              |                         |              +------+
>|host A|              |                         |              |host B|
>+------+              |                         |              +------+
>    |       +-------------------+     +-------------------+       |
>    +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+
>            +-------------------+     +-------------------+
>
>Even in the present of ISL between some switches, packet sent through host
>A interface connected to vlan 100 will only enter host B using the
>interface connected to vlan 100. So every slaves of the bonding interface
>can use the same MAC address.

	That's true.  The big problem with the "VLAN tunnel" approach is
that it's not tolerant of link failures.

>Of course, changing the destination address would be required in order to
>achieve ingress load balancing on a *single* LAN. But, as Jay noted at the
>beginning of this thread, this would violate 802.3ad.
>
>>> I think the only point is whether we need a new xmit_hash_policy for
>>> mode=802.3ad or whether mode=balance-rr could be enough.
>> May by, but it seems to me fair enough not to restrict this feature only
>> to non-LACP aggregate links; dynamic aggregation may be useful(it helps
>> to avoid switch misconfiguration(misconfigured slaves on switch side)
>> sometimes without loss of service).
>
>You are right, but such LAN setup need to be carefully designed and
>built. I'm not sure that an automatic channel aggregation system is the
>right way to do it. Hence the reason why I suggest to use balance-rr with
>VLANs.

	The "VLAN tunnel" approach is a derivative of an actual switch
topology that balance-rr was originally intended for, many moons ago.
This is described in the current bonding.txt; I'll cut & paste a bit
here:

12.2 Maximum Throughput in a Multiple Switch Topology
-----------------------------------------------------

        Multiple switches may be utilized to optimize for throughput
when they are configured in parallel as part of an isolated network
between two or more systems, for example:

                       +-----------+
                       |  Host A   | 
                       +-+---+---+-+
                         |   |   |
                +--------+   |   +---------+
                |            |             |
         +------+---+  +-----+----+  +-----+----+
         | Switch A |  | Switch B |  | Switch C |
         +------+---+  +-----+----+  +-----+----+
                |            |             |
                +--------+   |   +---------+
                         |   |   |
                       +-+---+---+-+
                       |  Host B   | 
                       +-----------+

        In this configuration, the switches are isolated from one
another.  One reason to employ a topology such as this is for an
isolated network with many hosts (a cluster configured for high
performance, for example), using multiple smaller switches can be more
cost effective than a single larger switch, e.g., on a network with 24
hosts, three 24 port switches can be significantly less expensive than
a single 72 port switch.

        If access beyond the network is required, an individual host
can be equipped with an additional network device connected to an
external network; this host then additionally acts as a gateway.

	[end of cut]

	This was described to me some time ago as an early usage model
for balance-rr using multiple 10 Mb/sec switches.  It has the same link
monitoring problems as the "VLAN tunnel" approach, although modern
switches with "trunk failover" type of functionality may be able to
mitigate the problem.

>>> Oleg, would you mind trying the above "two VLAN" topology" with
>>> mode=balance-rr and report any results ? For high-availability purpose,
>>> it's obviously necessary to setup those VLAN on distinct switches.
>> I'll do it, but it will take some time to setup test environment,
>> several days may be.
>
>Thanks. For testing purpose, it is enough to setup those VLAN on a single
>switch if it is easier for you to do.
>
>> You mean following topology:
>
>See above.
>
>> (i'm sure it will work as desired if each host is connected to each
>> switch with only one slave link, if there are more slaves in each switch
>> - unsure)?
>
>If you want to use more than 2 slaves per host, then you need more than 2
>VLAN. You also need to have the exact same number of slaves on all hosts,
>as egress path selection cause ingress path selection at the other side.
>
>            +-------------------+     +-------------------+
>    +-------|switch 1 - vlan 100|-----|switch 2 - vlan 100|-------+
>    |       +-------------------+     +-------------------+       |
>+------+              |                         |              +------+
>|host A|              |                         |              |host B|
>+------+              |                         |              +------+
>  | |       +-------------------+     +-------------------+       | |
>  | +-------|switch 3 - vlan 200|-----|switch 4 - vlan 200|-------+ |
>  |         +-------------------+     +-------------------+         |
>  |                   |                         |                   |
>  |                   |                         |                   |
>  |         +-------------------+     +-------------------+         |
>  +---------|switch 5 - vlan 300|-----|switch 6 - vlan 300|---------+
>            +-------------------+     +-------------------+
>
>Of course, you can add others host to vlan 100, 200 and 300, with the
>exact same configuration at host A or host B.

	This is essentially the same thing as the diagram I pasted in up
above, except with VLANs and an additional layer of switches between the
hosts.  The multiple VLANs take the place of multiple discrete switches.

	This could also be accomplished via bridge groups (in
Cisco-speak).  For example, instead of VLAN 100, that could be bridge
group X, VLAN 200 is bridge group Y, and so on.

	Neither the VLAN nor the bridge group methods handle link
failures very well; if, in the above diagram, the link from "switch 2
vlan 100" to "host B" fails, there's no way for host A to know to stop
sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
to "host B."

	One item I'd like to see some more data on is the level of
reordering at the receiver in Oleg's system.

	One of the reasons round robin isn't as useful as it once was is
due to the rise of NAPI and interrupt coalescing, both of which will
tend to increase the reordering of packets at the receiver when the
packets are evenly striped.  In the old days, it was one interrupt, one
packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
packets striped across interfaces, this will tend to increase
reordering.  E.g.,

	slave 1		slave 2		slave 3
	Packet 1	P2		P3
	P4		P5		P6
	P7		P8		P9

	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.

	I haven't done much testing with this lately, but I suspect this
behavior hasn't really changed.  Raising the tcp_reordering sysctl value
can mitigate this somewhat (by making TCP more tolerant of this), but
that doesn't help non-TCP protocols.

	Barring evidence to the contrary, I presume that Oleg's system
delivers out of order at the receiver.  That's not automatically a
reason to reject it, but this entire proposal is sufficiently complex to
configure that very explicit documentation will be necessary.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html