[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <24104.1343162975@death.nxdomain>
Date: Tue, 24 Jul 2012 13:49:35 -0700
From: Jay Vosburgh <fubar@...ibm.com>
To: Chris Friesen <chris.friesen@...band.com>
cc: Jiri Pirko <jiri@...nulli.us>, netdev <netdev@...r.kernel.org>,
andy@...yhouse.net
Subject: Re: bonding and SR-IOV -- do we need arp_validation for loadbalancing too?
Chris Friesen <chris.friesen@...band.com> wrote:
>On 07/24/2012 12:13 PM, Jay Vosburgh wrote:
>> Jiri Pirko<jiri@...nulli.us> wrote:
>>
>>> Tue, Jul 24, 2012 at 05:57:03PM CEST, chris.friesen@...band.com wrote:
>>>> Hi all,
>>>>
>>>> We've been starting to look at bonding VFs from separate physical
>>>> devices in a guest, but we've run into a problem.
>>>>
>>>> The host is bonding the corresponding PFs, and it uses arp
>>>> monitoring. What we have found is that any broadcast traffic from
>>>> the guest (if they enable arp monitoring, for example) will be seen
>>>> by the internal L2 switch of the NIC and sent up into the host, where
>>>> the bonding driver will count it as incoming packets and use it to
>>>> mark the link as good.
>>>>
>>>> The only solutions I've been able to come up with are:
>>>> 1) add arp validation for load balancing modes as well as active-backup.
>>> This is my favourite.... No reason to not to turn arp validation on.
>>> TEAM device (teamd arpping linkwatch) does arp or NSNA validation
>>> always.
>> How does that operate for a load balancing mode?
>>
>> For arp validate to function (as it's implemented in bonding),
>> the arp requests (broadcasts) or the arp replies (unicasts) must be seen
>> by each slave at regular intervals. Most load balance systems
>> (etherchannel or 802.3ad, for example) don't flood the broadcast
>> requests to all members of a channel group, and the unicast replies only
>> go to one member.
>>
>> This generally results in either only one slave staying up, or
>> slaves going up and down at odd intervals. The arp monitor for the load
>> balance modes is already dependent upon there being a steady stream of
>> traffic to all slaves, and can be unreliable in low traffic conditions
>> (because not all slaves receive traffic with sufficient frequency).
>
>In loadbalance mode wouldn't it just work similar to active-backup? If
>it's a reply then verify that it came from the arp target, if it's a
>request then check to see if it came from one of the other slaves.
The problem isn't verifying the requests or replies, it's that
the ARP packets are not distributed across all slaves (because the
switch ports are in a channel group / aggregator), so some slaves do not
receive any ARPs.
The bond sends the ARP request as a broadcast. For
active-backup, this ends up at the inactive slaves because the switch
sends the broadcast to all ports. For a loadbalance mode, the switch
won't send the broadcast ARP to the other slaves, because all the slaves
are in a channel group or lacp aggregator, which is treated by the
switch as effectively a single switch port for this case.
Similarly, the ARP replies are unicast, and the switch will send
those unicast replies to only one member of the channel group or
aggregator. The choice there is usually a hash of some kind, so
generally only one slave will receive the replies.
>In our case we have control over the L2 switches involved so we ensure
>that the broadcast arp request is sent to all the other slaves, while the
>reply comes back to the sender. I think we still have a window where you
>could have a device with a faulty tx but functional rx and never detect
>the problem in the monitor.
You can set up -xor or -rr mode against a switch without setting
up a channel group on the switch, but that has the down side that any
incoming broadcast or multicast packet may be received multiple times
(one copy per slave). Some switches will also disable ports (due to MAC
flapping) or complain about seeing the same MAC address on multiple
ports for this case. This also will not load balance incoming traffic
to the bond very well.
>On 07/24/2012 02:18 PM, Chris Friesen wrote:
>> A more general solution might be to have the device driver also track
>> the time of the last incoming packet that came from the external network
>> (rather than a VF) and having the bond driver ignore those packets for
>> the purpose of link health. Doing this efficiently would likely require
>> some kind of hardware support though--as an example the 82599 seems to
>> support this with the "LB" bit in the rx descriptor.
>
>That should of course be reversed. We want the bond driver to only use
>the packets from the external network for the purpose of link health.
>
>Does anyone other than bonding actually care about dev->last_rx? If not
>then we could just change the drivers to only set it for external packets.
I believe bonding is the main user of last_rx (a search shows a
couple of drivers using it internally). For bonding use, in current
mainline last_rx is set by bonding itself, not in the network device
driver.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists