netdev - Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 18 Jan 2011 17:45:45 -0800
From:	Jay Vosburgh <fubar@...ibm.com>
To:	=?UTF-8?B?Tmljb2xhcyBkZSBQZXNsb8O8YW4=?= 
	<nicolas.2p.debian@...il.com>
cc:	"Oleg V. Ukhno" <olegu@...dex-team.ru>,
	John Fastabend <john.r.fastabend@...el.com>,
	"David S. Miller" <davem@...emloft.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	=?UTF-8?B?U8OpYmFzdGllbiBCYXJyw6k=?= 
	<sebastien.barre@...ouvain.be>,
	Christophe Paasch <christoph.paasch@...ouvain.be>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for single TCP session balancing

Nicolas de Pesloüan <nicolas.2p.debian@...il.com> wrote:

>Le 18/01/2011 21:24, Jay Vosburgh a écrit :
>> Nicolas de Pesloüan<nicolas.2p.debian@...il.com>  wrote:
>
>>>>> - it is possible to detect path failure using arp monitoring instead of
>>>>> miimon.
>>
>> 	I don't think this is true, at least not for the case of
>> balance-rr.  Using ARP monitoring with any sort of load balance scheme
>> is problematic, because the replies may be balanced to a different slave
>> than the sender.
>
>Cannot we achieve the expected arp monitoring by using the exact same
>artifice that Oleg suggested: using a different source MAC per slave for
>arp monitoring, so that return path match sending path ?

	It's not as simple with ARP, because it's a control protocol
that has side effects.

	First, the MAC level broadcast ARP probes from bonding would
have to be round robined in such a manner that they regularly arrive at
every possible slave.  A single broadcast won't be sent to more than one
member of the channel group by the switch.  We can't do multiple unicast
ARPs with different destination MAC addresses, because we'd have to
track all of those MACs somewhere (keep track of the MAC of every slave
on each peer we're monitoring).  I suspect that snooping switches will
get all whiny about port flapping and the like.

	We could have a separate IP address per slave, used only for
link monitoring, but that's a huge headache.  Actually, it's a lot like
the multi-link stuff I've been working on (and posted RFC of in
December), but that doesn't use ARP (it segregates slaves by IP subnet,
and balances at the IP layer).  Basically, you need a overlaying active
protocol to handle the map of which slave goes where (which multi-link
has).

	So, maybe we have the ARP replies massaged such that the
Ethernet header source and ARP target hardware address don't match.

	So the probes from bonding currently look like this:

MAC-A > ff:ff:ff:ff:ff:ff Request who-has 10.0.4.2 tell 10.0.1.1

	Where MAC-A is the bond's MAC address.  And the replies now look
like this:

MAC-B > MAC-A, Reply 10.0.4.2 is-at MAC-B

	Where MAC-B is the MAC of the peer's bond.  The massaged replies
would be of the form:

MAC-C > MAC-A, Reply 10.0.4.2 is-at MAC-B

	where MAC-C is the slave "permanent" address (which is really a
fake address to manipulate the switch's hash), and MAC-B is whatever the
real MAC of the bond is.  I don't think we can mess with MAC-B in the
reply (the "is-at" part), because that would update ARP tables and such.
If we change MAC-A in the reply, they're liable to be filtered out.  I
really don't know if putting MAC-C in there as the source would confuse
snooping switches or not.

	One other thought I had while chewing on this is to run the LACP
protocol exchange between the bonding peers directly, instead of between
each bond and each switch.  I have no idea if this would work or not,
but the theory would look something like the "VLAN tunnel" topology for
the switches, but the bonds at the ends are configured for 802.3ad.  To
make this work, bonding would have to be able to run mutiple LACP
instances (one for each bonding peer on the network) over a single
aggregator (or permit slaves to belong to multiple active aggregators).
This would basically be the same as the multi-link business, except
using LACP for the active protocol to build the map.

	A distinguished correspondent (who may confess if he so chooses)
also suggested 802.2 LLC XID or TEST frames, which have been discussed
in the past.  Those don't have side effects, but I'm not sure if either
is technically feasible, or if we really want bonding to have a
dependency on llc.  They would also only interop with hosts that respond
to the XID or TEST.  I haven't thought about this in detail for a number
of years, but I think the LLC DSAP / SSAP space is pretty small.

>>>>> - changing the destination MAC address of egress packets are not
>>>>> necessary, because egress path selection force ingress path selection
>>>>> due to the VLAN.
>>
>> 	This is true, with one comment: Oleg's proposal we're discussing
>> changes the source MAC address of outgoing packets, not the destination.
>> The purpose being to manipulate the src-mac balancing algorithm on the
>> switch when the packets are hashed at the egress port channel group.
>> The packets (for a particular destination) all bear the same destination
>> MAC, but (as I understand it) are manually assigned tailored source MAC
>> addresses that hash to sequential values.
>
>Yes, you're right.
>
>> 	That's true.  The big problem with the "VLAN tunnel" approach is
>> that it's not tolerant of link failures.
>
>Yes, except if we find a way to make arp monitoring reliable in load balancing situation.
>
>[snip]
>
>> 	This is essentially the same thing as the diagram I pasted in up
>> above, except with VLANs and an additional layer of switches between the
>> hosts.  The multiple VLANs take the place of multiple discrete switches.
>>
>> 	This could also be accomplished via bridge groups (in
>> Cisco-speak).  For example, instead of VLAN 100, that could be bridge
>> group X, VLAN 200 is bridge group Y, and so on.
>>
>> 	Neither the VLAN nor the bridge group methods handle link
>> failures very well; if, in the above diagram, the link from "switch 2
>> vlan 100" to "host B" fails, there's no way for host A to know to stop
>> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
>> to "host B."
>
>Can't we imagine to "arp monitor" the destination MAC address of host B,
>on both paths ? That way, host A would know that a given path is down,
>because return path would be the same. The target host should send the
>reply on the slave on which it receive the request, which is the normal
>way to reply to arp request.

	I think you can only get away with this if each slave set (where
a "set" is one slave from each bond that's attending our little load
balancing party) is on a separate switch domain, and the switch domains
are not bridged together.  Otherwise the switches will flap their MAC
tables as they update from each probe that they see.

	As for the reply going out the same slave, to do that, bonding
would have to intercept the ARP traffic (because ARPs arriving on slaves
are normally assigned to the bond itself, not the slave) and track and
tweak them.

	Lastly, bonding would again have to maintain a map, showing
which destinations are reachable via which set of slaves.  All peer
systems (needing to have per-slave link monitoring) would have to be ARP
targets.

>> 	One item I'd like to see some more data on is the level of
>> reordering at the receiver in Oleg's system.
>
>This is exactly the reason why I asked Oleg to do some test with
>balance-rr. I cannot find a good reason for a possibly new
>xmit_hash_policy to provide better throughput than current balance-rr. If
>the throughput increase by, let's say, less than 20%, whatever
>tcp_reordering value, then it is probably a dead end way.

	Well, the point of making a round robin xmit_hash_policy isn't
that the throughput will be better than the existing round robin, it's
to make round-robin accessible to the 802.3ad mode.

>> 	One of the reasons round robin isn't as useful as it once was is
>> due to the rise of NAPI and interrupt coalescing, both of which will
>> tend to increase the reordering of packets at the receiver when the
>> packets are evenly striped.  In the old days, it was one interrupt, one
>> packet.  Now, it's one interrupt or NAPI poll, many packets.  With the
>> packets striped across interfaces, this will tend to increase
>> reordering.  E.g.,
>>
>> 	slave 1		slave 2		slave 3
>> 	Packet 1	P2		P3
>> 	P4		P5		P6
>> 	P7		P8		P9
>>
>> 	and so on.  A poll of slave 1 will get packets 1, 4 and 7 (and
>> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
>
>Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7,
>P8, P9 on slave3, possibly by sending grouped packets, changing the
>sending slave every N packets instead of every packet ? I think we already
>discussed this possibility a few months or years ago in bonding-devel
>ML. For as far as I remember, the idea was not developed because it was
>not easy to find the number of packets to send through the same
>slave. Anyway, this might help reduce out of order delivery.

	Yes, this came up several years ago, and, basically, there's no
way to do it perfectly.  An interesting experiment would be to see if
sending groups (perhaps close to the NAPI weight of the receiver) would
reduce reordering.

>> 	Barring evidence to the contrary, I presume that Oleg's system
>> delivers out of order at the receiver.  That's not automatically a
>> reason to reject it, but this entire proposal is sufficiently complex to
>> configure that very explicit documentation will be necessary.
>
>Yes, and this is already true for some bonding modes and in particular for balance-rr.

	I don't think any modes other than balance-rr will deliver out
of order normally.  It can happen during edge cases, e.g., alb
rebalance, or the layer3+4 hash with IP fragments, but I'd expect those
to be at a much lower rate than what round robin causes.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html