linux-kernel - Re: [PATCH] net/bonding: send arp in interval if no active slave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55E97AC6.6030106@redknee.com>
Date:	Fri, 4 Sep 2015 13:04:38 +0200
From:	Uwe Koziolek <uwe.koziolek@...knee.com>
To:	Jay Vosburgh <jay.vosburgh@...onical.com>
CC:	Andy Gospodarek <gospo@...ulusnetworks.com>,
	Jarod Wilson <jarod@...hat.com>,
	Veaceslav Falico <vfalico@...il.com>,
	<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>
Subject: Re: [PATCH] net/bonding: send arp in interval if no active slave

Am 03.09.2015 um 17:05 schrieb Jay Vosburgh:
> Uwe Koziolek <uwe.koziolek@...knee.com> wrote:
>
>> On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>>> Uwe Koziolek <uwe.koziolek@...knee.com> wrote:
>>>>>
>>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>>> From: Uwe Koziolek <uwe.koziolek@...knee.com>
>>>>>>>>>
>>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>>> into
>>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>>> one
>>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>>> so we
>>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>>
>>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>>> additional
>>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>>> with
>>>>>>>>> this hardware combination.
>>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>>> are
>>>>>>>> we fixiing here.
>>>>>>>>
>>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>>> of no
>>>>>>>> current active slave for 2*arp_interval.
>>>>>>>>
>>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>>> works? :)
>>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>>> illustrate the exact nature of the problem.
>>>>>>>
>>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>>> here...
>>>>>>>
>>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>>> num_grat_arp.
>>>>>> The bonding slaves are connected to high available switches, each of the
>>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>>> received, this slave and the bond is going into state up, sending the
>>>>>> gratitious arps...
>>>>>> But if you got no arp reply the next slave was selected.
>>>>>> With most of the newer switches, not overloaded, or with other software
>>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>>> on the first arp request.
>>>>>> But in case of high availability configuration with non perfect switches
>>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>>> response on the first arp request.
>>>>>>
>>>>>> I have seen network snoops, there the switches are not responding to the
>>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>>> but the response was received on slave one,  and all following arp
>>>>>> requests are anwsered on the wrong slave for a longer time.
>>>>> 	Could you elaborate on the exact "high availability
>>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>>
>>>>> 	Is this some kind of race between the switch or switches
>>>>> updating the forwarding tables and the bond flip flopping between the
>>>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>>>> on slave 2 is sent (using the same source MAC, unless you set
>>>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>>>> reply is sent there.
>>>> High availability:
>>>> 2 managed switches with routing capabilities have an interconnect.
>>>> One slave of a bonding interface is connected to the first switch, the
>>>> second slave is connected to the other switch.
>>>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>>>> remember also HP E 3500 and  E 3800 are also
>>>> affected, for the affected Cisco models I can't answer today.
>>>> Affected single switch configurations was not seen.
>>>>
>>>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>>>> well matching explanation for the problem.
>>>>
>>>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>>>> same slave, delayed by arp_interval.
>>>>>> Using problematic switches i have seen the the arp response on the right
>>>>>> slave at latest on the second arp request. So the bond is going into state
>>>>>> up.
>>>>>>
>>>>>> How does it works:
>>>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>>>> procedure, the other part of this procedure is handling the slave change.
>>>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>>>> of bond_ab_arp_probe.
>>>>>> Now the retries are not only for an up bond available, they are also
>>>>>> implemented for a down bond.
>>>>> 	Does this delay failover or bringup on switches that are not
>>>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>>>> this impact failover / recovery times?
>>>>>
>>>>> 	-J
>>>> It depends.
>>>> failover times are not impacted, this is handled different.
>>>> Only the transition from a down bonding interface (bond and all slaves are
>>>> down) to the state up can be increased by up to 2 times arp_interval,
>>>> If the selected interface did not came up .If well working switches are
>>>> used, and everything other is also ok, there are no impacts.
>>> So I'm not a huge fan of workarounds like these, but I also understand
>>> from a practical standpoint that this is useful.  My only issue with the
>>> patch would be to please include a small comment (1-2 lines) in the code
>>> that describes the behavior.  I know we have the changelog entries for
>>> this, but I would feel better about having an exception like this in the
>>> code for those reading it and wondering:
>>>
>>> "Why would we wait 2 intervals before failing over to the next interface
>>> when there are no active interfaces?"
>>>
>>
>> diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> --- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
>> +++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
>> @@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon
>>                         return should_notify_rtnl;
>>         }
>>
>> +       /* sometimes the forwarding tables of the switches are not updated fast enough
>> +        * the first arp response after a slave change is received on the wrong slave.
>> +        * the arp requests will be retried 2 times on the same slave
>> +        */
>> +
>> +       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>> +               bond_arp_send_all(bond, curr_arp_slave);
>> +               return should_notify_rtnl;
>> +       }
>> +
>
> 	I probably should have asked this in the beginning, but at what
> range of arp_interval values does the problem manifest?  If it's a race
> condition with the switch update, I'd expect that only very small
> arp_interval values would be affected.
>
> 	Also, your proposed comment wraps past 80 columns.
>
> 	-J
>
Only 500 msecs arp interval is used, no other values are checked.
Wraps in patch are now removed.

diff -up ./drivers/net/bonding/bond_main.c.orig ./drivers/net/bonding/bond_main.c
--- ./drivers/net/bonding/bond_main.c.orig	2015-08-30 20:34:09.000000000 +0200
+++ ./drivers/net/bonding/bond_main.c	2015-09-04 11:59:05.755897182 +0200
@@ -2795,6 +2795,17 @@ static bool bond_ab_arp_probe(struct bon
  			return should_notify_rtnl;
  	}

+	/* sometimes the forwarding tables of the switches are not updated
+	 * fast enough. the first arp response after a slave change is received
+	 * on the wrong slave.
+	 * the arp requests will be retried 2 times on the same slave
+	 */
+
+	if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+		bond_arp_send_all(bond, curr_arp_slave);
+		return should_notify_rtnl;
+	}
+
  	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);

  	bond_for_each_slave_rcu(bond, slave, iter) {

>
>>         bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>>
>>         bond_for_each_slave_rcu(bond, slave, iter) {
>>
>>>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>>>> only used, if a different slave is going active.
>>>>>> But in our case, the bonding slaves are not going into the state active
>>>>>> for a longer time.
>>>>>>>>> [jarod: manufacturing of changelog]
>>>>>>>>> CC: Jay Vosburgh <j.vosburgh@...il.com>
>>>>>>>>> CC: Veaceslav Falico <vfalico@...il.com>
>>>>>>>>> CC: Andy Gospodarek <gospo@...ulusnetworks.com>
>>>>>>>>> CC: netdev@...r.kernel.org
>>>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@...knee.com>
>>>>>>>>> Signed-off-by: Jarod Wilson <jarod@...hat.com>
>>>>>>>>> ---
>>>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>>>> index 0c627b4..60b9483 100644
>>>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>>>> *bond)
>>>>>>>>>               return should_notify_rtnl;
>>>>>>>>>       }
>>>>>>>>>
>>>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>>>> {
>>>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>>>> +        return should_notify_rtnl;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>>       bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>>>
>>>>>>>>>       bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>>>> --
>>>>>>>>> 1.8.3.1
>
> ---
> 	-Jay Vosburgh, jay.vosburgh@...onical.com
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/