netdev - Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKdSkDVXe5jRgWU8WAvx43KvEuZvS4ak+a9qd9H4C3zy+OvkPQ@mail.gmail.com>
Date:	Tue, 28 Jun 2016 14:57:37 +0300
From:	Veli-Matti Lintu <veli-matti.lintu@...nsys.fi>
To:	Jay Vosburgh <jay.vosburgh@...onical.com>
Cc:	netdev <netdev@...r.kernel.org>,
	Veaceslav Falico <vfalico@...il.com>,
	Andy Gospodarek <gospo@...ulusnetworks.com>,
	zhuyj <zyjzyj2000@...il.com>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

2016-06-24 0:20 GMT+03:00 Jay Vosburgh <jay.vosburgh@...onical.com>:
>
>         Since commit 7bb11dc9f59d ("bonding: unify all places where
> actor-oper key needs to be updated."), the logic in bonding to handle
> selection between multiple aggregators has not functioned.
>
>         This affects only configurations wherein the bonding slaves
> connect to two discrete aggregators (e.g., two independent switches, each
> with LACP enabled), thus creating two separate aggregation groups within a
> single bond.
>
>         The cause is a change in 7bb11dc9f59d to no longer set
> AD_PORT_BEGIN on a port after a link state change, which would cause the
> port to be reselected for attachment to an aggregator as if were newly
> added to the bond.  We cannot restore the prior behavior, as it
> contradicts IEEE 802.1AX 5.4.12, which requires ports that "become
> inoperable" (lose carrier, setting port_enabled=false as per 802.1AX
> 5.4.7) to remain selected (i.e., assigned to the aggregator).  As the port
> now remains selected, the aggregator selection logic is not invoked.
>
>         A side effect of this change is that aggregators in bonding will
> now contain ports that are link down.  The aggregator selection logic
> does not currently handle this situation correctly, causing incorrect
> aggregator selection.
>
>         This patch makes two changes to repair the aggregator selection
> logic in bonding to function as documented and within the confines of the
> standard:
>
>         First, the aggregator selection and related logic now utilizes the
> number of active ports per aggregator, not the number of selected ports
> (as some selected ports may be down).  The ad_select "bandwidth" and
> "count" options only consider ports that are link up.
>
>         Second, on any carrier state change of any slave, the aggregator
> selection logic is explicitly called to insure the correct aggregator is
> active.
>
> Reported-by: Veli-Matti Lintu <veli-matti.lintu@...nsys.fi>
> Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated.")
> Signed-off-by: Jay Vosburgh <jay.vosburgh@...onical.com>


Hi,

Thanks for the patch. I have been now testing it and the reselection
seems to be working now in most cases, but I hit one case that seems
to consistently fail in my test environment.

I've been doing most of testing with ad_select=count and this happens
with it. I haven't yet done extensive testing with
ad_select=stable/bandwidth.

The sequence to trigger the failure seems to be:

  Switch A (Agg ID 2)       Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0    enp5s0f1 ens5f1 ens6f1
    X       X      -           X      -       -     Connection works
(Agg ID 2 active)
    X       -      -           X      -       -     Connection works
(Agg ID 1 active)
    X       -      -           -      -       -     No connection (Agg
ID 2 active)

I'm also wondering why link down event causes change of aggregator
when the active aggregator has the same number of active links than
the new aggregator.

The situation here clears once a port comes up. Once I hit also
problems without disabling all ports on active switch:

  Switch A (Agg ID 2)       Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0    enp5s0f1 ens5f1 ens6f1
    X       X      -           X      -       -     Connection works
(Agg ID 2 active)
    X       -      -           X      -       -     No connection (Agg
ID 1 active)

The active switch does not seem to matter either when disabling ports:

  Switch A (Agg ID 2)       Switch B (Agg ID 1)
enp5s0f0 ens5f0 ens6f0    enp5s0f1 ens5f1 ens6f1
    X       -      -           X      X       X     Connection works
(Agg ID 1 active)
    X       -      -           X      -       -     Connection works
(Agg ID 2 active)
    -       -      -           X      -       -     No connection (Agg
ID 1 active)

All testing was done with upstream version 4.6.2 with the patch
applied. When there is no connection, /proc/net/bonding/bond0 still
shows that there is an active aggregator that has a link up, but for
some reason no traffic passes through. I added some debugging
information in bond_procfs.c and the number of active ports seems to
match the enabled ports on switches.

I'll continue doing tests with different scenarios and I can also test
specific cases if needed.


Veli-Matti