[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKdSkDVXe5jRgWU8WAvx43KvEuZvS4ak+a9qd9H4C3zy+OvkPQ@mail.gmail.com>
Date: Tue, 28 Jun 2016 14:57:37 +0300
From: Veli-Matti Lintu <veli-matti.lintu@...nsys.fi>
To: Jay Vosburgh <jay.vosburgh@...onical.com>
Cc: netdev <netdev@...r.kernel.org>,
Veaceslav Falico <vfalico@...il.com>,
Andy Gospodarek <gospo@...ulusnetworks.com>,
zhuyj <zyjzyj2000@...il.com>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH net] bonding: fix 802.3ad aggregator reselection
2016-06-24 0:20 GMT+03:00 Jay Vosburgh <jay.vosburgh@...onical.com>:
>
> Since commit 7bb11dc9f59d ("bonding: unify all places where
> actor-oper key needs to be updated."), the logic in bonding to handle
> selection between multiple aggregators has not functioned.
>
> This affects only configurations wherein the bonding slaves
> connect to two discrete aggregators (e.g., two independent switches, each
> with LACP enabled), thus creating two separate aggregation groups within a
> single bond.
>
> The cause is a change in 7bb11dc9f59d to no longer set
> AD_PORT_BEGIN on a port after a link state change, which would cause the
> port to be reselected for attachment to an aggregator as if were newly
> added to the bond. We cannot restore the prior behavior, as it
> contradicts IEEE 802.1AX 5.4.12, which requires ports that "become
> inoperable" (lose carrier, setting port_enabled=false as per 802.1AX
> 5.4.7) to remain selected (i.e., assigned to the aggregator). As the port
> now remains selected, the aggregator selection logic is not invoked.
>
> A side effect of this change is that aggregators in bonding will
> now contain ports that are link down. The aggregator selection logic
> does not currently handle this situation correctly, causing incorrect
> aggregator selection.
>
> This patch makes two changes to repair the aggregator selection
> logic in bonding to function as documented and within the confines of the
> standard:
>
> First, the aggregator selection and related logic now utilizes the
> number of active ports per aggregator, not the number of selected ports
> (as some selected ports may be down). The ad_select "bandwidth" and
> "count" options only consider ports that are link up.
>
> Second, on any carrier state change of any slave, the aggregator
> selection logic is explicitly called to insure the correct aggregator is
> active.
>
> Reported-by: Veli-Matti Lintu <veli-matti.lintu@...nsys.fi>
> Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated.")
> Signed-off-by: Jay Vosburgh <jay.vosburgh@...onical.com>
Hi,
Thanks for the patch. I have been now testing it and the reselection
seems to be working now in most cases, but I hit one case that seems
to consistently fail in my test environment.
I've been doing most of testing with ad_select=count and this happens
with it. I haven't yet done extensive testing with
ad_select=stable/bandwidth.
The sequence to trigger the failure seems to be:
Switch A (Agg ID 2) Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1
X X - X - - Connection works
(Agg ID 2 active)
X - - X - - Connection works
(Agg ID 1 active)
X - - - - - No connection (Agg
ID 2 active)
I'm also wondering why link down event causes change of aggregator
when the active aggregator has the same number of active links than
the new aggregator.
The situation here clears once a port comes up. Once I hit also
problems without disabling all ports on active switch:
Switch A (Agg ID 2) Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1
X X - X - - Connection works
(Agg ID 2 active)
X - - X - - No connection (Agg
ID 1 active)
The active switch does not seem to matter either when disabling ports:
Switch A (Agg ID 2) Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0 enp5s0f1 ens5f1 ens6f1
X - - X X X Connection works
(Agg ID 1 active)
X - - X - - Connection works
(Agg ID 2 active)
- - - X - - No connection (Agg
ID 1 active)
All testing was done with upstream version 4.6.2 with the patch
applied. When there is no connection, /proc/net/bonding/bond0 still
shows that there is an active aggregator that has a link up, but for
some reason no traffic passes through. I added some debugging
information in bond_procfs.c and the number of active ports seems to
match the enabled ports on switches.
I'll continue doing tests with different scenarios and I can also test
specific cases if needed.
Veli-Matti
Powered by blists - more mailing lists