[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D360408.1080104@gmail.com>
Date: Tue, 18 Jan 2011 22:20:08 +0100
From: Nicolas de Pesloüan
<nicolas.2p.debian@...il.com>
To: Jay Vosburgh <fubar@...ibm.com>
CC: "Oleg V. Ukhno" <olegu@...dex-team.ru>,
John Fastabend <john.r.fastabend@...el.com>,
"David S. Miller" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Sébastien Barré <sebastien.barre@...ouvain.be>,
Christophe Paasch <christoph.paasch@...ouvain.be>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for
single TCP session balancing
Le 18/01/2011 21:24, Jay Vosburgh a écrit :
> Nicolas de Pesloüan<nicolas.2p.debian@...il.com> wrote:
>>>> - it is possible to detect path failure using arp monitoring instead of
>>>> miimon.
>
> I don't think this is true, at least not for the case of
> balance-rr. Using ARP monitoring with any sort of load balance scheme
> is problematic, because the replies may be balanced to a different slave
> than the sender.
Cannot we achieve the expected arp monitoring by using the exact same artifice that Oleg suggested:
using a different source MAC per slave for arp monitoring, so that return path match sending path ?
>>>> - changing the destination MAC address of egress packets are not
>>>> necessary, because egress path selection force ingress path selection
>>>> due to the VLAN.
>
> This is true, with one comment: Oleg's proposal we're discussing
> changes the source MAC address of outgoing packets, not the destination.
> The purpose being to manipulate the src-mac balancing algorithm on the
> switch when the packets are hashed at the egress port channel group.
> The packets (for a particular destination) all bear the same destination
> MAC, but (as I understand it) are manually assigned tailored source MAC
> addresses that hash to sequential values.
Yes, you're right.
> That's true. The big problem with the "VLAN tunnel" approach is
> that it's not tolerant of link failures.
Yes, except if we find a way to make arp monitoring reliable in load balancing situation.
[snip]
> This is essentially the same thing as the diagram I pasted in up
> above, except with VLANs and an additional layer of switches between the
> hosts. The multiple VLANs take the place of multiple discrete switches.
>
> This could also be accomplished via bridge groups (in
> Cisco-speak). For example, instead of VLAN 100, that could be bridge
> group X, VLAN 200 is bridge group Y, and so on.
>
> Neither the VLAN nor the bridge group methods handle link
> failures very well; if, in the above diagram, the link from "switch 2
> vlan 100" to "host B" fails, there's no way for host A to know to stop
> sending to "switch 1 vlan 100," and there's no backup path for VLAN 100
> to "host B."
Can't we imagine to "arp monitor" the destination MAC address of host B, on both paths ? That way,
host A would know that a given path is down, because return path would be the same. The target host
should send the reply on the slave on which it receive the request, which is the normal way to reply
to arp request.
> One item I'd like to see some more data on is the level of
> reordering at the receiver in Oleg's system.
This is exactly the reason why I asked Oleg to do some test with balance-rr. I cannot find a good
reason for a possibly new xmit_hash_policy to provide better throughput than current balance-rr. If
the throughput increase by, let's say, less than 20%, whatever tcp_reordering value, then it is
probably a dead end way.
> One of the reasons round robin isn't as useful as it once was is
> due to the rise of NAPI and interrupt coalescing, both of which will
> tend to increase the reordering of packets at the receiver when the
> packets are evenly striped. In the old days, it was one interrupt, one
> packet. Now, it's one interrupt or NAPI poll, many packets. With the
> packets striped across interfaces, this will tend to increase
> reordering. E.g.,
>
> slave 1 slave 2 slave 3
> Packet 1 P2 P3
> P4 P5 P6
> P7 P8 P9
>
> and so on. A poll of slave 1 will get packets 1, 4 and 7 (and
> probably several more), then a poll of slave 2 will get 2, 5 and 8, etc.
Any chance to receive P1, P2, P3 on slave 1, P4, P5, P6 on slave 2 et P7, P8, P9 on slave3, possibly
by sending grouped packets, changing the sending slave every N packets instead of every packet ? I
think we already discussed this possibility a few months or years ago in bonding-devel ML. For as
far as I remember, the idea was not developed because it was not easy to find the number of packets
to send through the same slave. Anyway, this might help reduce out of order delivery.
> Barring evidence to the contrary, I presume that Oleg's system
> delivers out of order at the receiver. That's not automatically a
> reason to reject it, but this entire proposal is sufficiently complex to
> configure that very explicit documentation will be necessary.
Yes, and this is already true for some bonding modes and in particular for balance-rr.
Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists