netdev - Re: Fwd: [PATCH] bonding/802.3ad: fix slave initialization states race

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJYOGF-84BfK8DvAnam9+tgfo4=oBs04zF-ETWRfhz7CE_9oBA@mail.gmail.com>
Date:   Thu, 26 Sep 2019 17:25:36 +0300
From:   Aleksei Zakharov <zaharov@...ectel.ru>
To:     Jay Vosburgh <jay.vosburgh@...onical.com>
Cc:     netdev@...r.kernel.org, "zhangsha (A)" <zhangsha.zhang@...wei.com>
Subject: Re: Fwd: [PATCH] bonding/802.3ad: fix slave initialization states race

чт, 26 сент. 2019 г. в 07:38, Jay Vosburgh <jay.vosburgh@...onical.com>:
>
> Aleksei Zakharov <zaharov@...ectel.ru> wrote:
>
> >ср, 25 сент. 2019 г. в 03:31, Jay Vosburgh <jay.vosburgh@...onical.com>:
> >>
> >> Алексей Захаров wrote:
> >> [...]
> >> >Right after reboot one of the slaves hangs with actor port state 71
> >> >and partner port state 1.
> >> >It doesn't send lacpdu and seems to be broken.
> >> >Setting link down and up again fixes slave state.
> >> [...]
> >>
> >>         I think I see what failed in the first patch, could you test the
> >> following patch?  This one is for net-next, so you'd need to again swap
> >> slave_err / netdev_err for the Ubuntu 4.15 kernel.
> >>
> >I've tested new patch. It seems to work. I can't reproduce the bug
> >with this patch.
> >There are two types of messages when link becomes up:
> >First:
> >bond-san: EVENT 1 llu 4294895911 slave eth2
> >8021q: adding VLAN 0 to HW filter on device eth2
> >bond-san: link status definitely down for interface eth2, disabling it
> >mlx4_en: eth2: Link Up
> >bond-san: EVENT 4 llu 4294895911 slave eth2
> >bond-san: link status up for interface eth2, enabling it in 500 ms
> >bond-san: invalid new link 3 on slave eth2
> >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex
> >Second:
> >bond-san: EVENT 1 llu 4295147594 slave eth2
> >8021q: adding VLAN 0 to HW filter on device eth2
> >mlx4_en: eth2: Link Up
> >bond-san: EVENT 4 llu 4295147594 slave eth2
> >bond-san: link status up again after 0 ms for interface eth2
> >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex
> > [...]
>
>         The "invalid new link" is appearing because bond_miimon_commit
> is being asked to commit a new state that isn't UP or DOWN (3 is
> BOND_LINK_BACK).  I looked through the patched code today, and I don't
> see a way to get to that message with the new link set to 3, so I'll add
> some instrumentation and send out another patch to figure out what's
> going on, as that shouldn't happen.
>
>         I don't see the "invalid" message testing locally, I think
> because my network device doesn't transition to carrier up as quickly as
> yours.  I thought you were getting BOND_LINK_BACK passed through from
> bond_enslave (which calls bond_set_slave_link_state, which will set
> link_new_link to BOND_LINK_BACK and leave it there), but the
> link_new_link is reset first thing in bond_miimon_inspect, so I'm not
> sure how it gets into bond_miimon_commit (I'm thinking perhaps a
> concurrent commit triggered by another slave, which then picks up this
> proposed link state change by happenstance).
I assume that "invalid new link" happens in this way:
Interface goes up
NETDEV_CHANGE event occurs
bond_update_speed_duplex fails
and slave->last_link_up returns true
slave->link becomes BOND_LINK_FAIL
bond_check_dev_link returns 0
miimon proposes slave->link_new_state BOND_LINK_DOWN
NETDEV_UP event occurs
miimon sets commit++
miimon proposes slave->link_new_state BOND_LINK_BACK
miimon sets slave->link to BOND_LINK_BACK
we have updelay configured, so it doesn't set BOND_LINK_UP in the next
case section
miimon says "Invalid new link" and sets link state UP during next
inspection(after updelay, i suppose)

For the second type of messages it looks like this:
Interface goes up
NETDEV_CHANGE event occurs
bond_update_speed_duplex fails
and slave->last_link_up returns true
slave->link becomes BOND_LINK_FAIL
NETDEV_UP event occurs
bond_check_dev_link returns 1
miimon proposes slave->link_new_state BOND_LINK_UP and says "link
status up again"

My first patch changed slave->last_link_up check to (slave->link ==
BOND_LINK_UP).
This check looks more consistent for me, but I might be wrong here.
As a result if link was in BOND_LINK_FAIL or BOND_LINK_BACK when
CHANGE or UP event,
it became BOND_LINK_DOWN.
But if it was initially UP and bond_update_speed_duplex was unable to
get speed/duplex,
link became BOND_LINK_FAIL.

I don't understand a few things here:
How could a link be in a different state from time to time during the
first NETDEV_* event?
And why slave->last_link_up is set when the first NETDEV event occurs?

I hope I didn't messed things up too much here.

-- 
Best Regards,
Aleksei Zakharov