linux-kernel - Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAKfmpSfnD6-7UmaMCN10xvhZq1cbqosRQd9S6H_xjT=Oqi41JA@mail.gmail.com>
Date:   Fri, 18 Dec 2020 18:48:32 -0500
From:   Jarod Wilson <jarod@...hat.com>
To:     Jay Vosburgh <jay.vosburgh@...onical.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Mahesh Bandewar <maheshb@...gle.com>,
        Veaceslav Falico <vfalico@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Thomas Davis <tadavis@....gov>, Netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

On Tue, Dec 8, 2020 at 3:35 PM Jay Vosburgh <jay.vosburgh@...onical.com> wrote:
>
> Jarod Wilson <jarod@...hat.com> wrote:
...
> >The addition of a case BOND_LINK_BACK in bond_miimon_commit() is somewhat
> >separate from the fix for the actual hang, but it eliminates a constant
> >"invalid new link 3 on slave" message seen related to this issue, and it's
> >not actually an invalid state here, so we shouldn't be reporting it as an
> >error.
...
>         In principle, bond_miimon_commit should not see _BACK or _FAIL
> state as a new link state, because those states should be managed at the
> bond_miimon_inspect level (as they are the result of updelay and
> downdelay).  These states should not be "committed" in the sense of
> causing notifications or doing actions that require RTNL.
>
>         My recollection is that the "invalid new link" messages were the
> result of a bug in de77ecd4ef02, which was fixed in 1899bb325149
> ("bonding: fix state transition issue in link monitoring"), but maybe
> the RTNL problem here induces that in some other fashion.
>
>         Either way, I believe this message is correct as-is.

For reference, with 5.10.1 and this script:

#!/bin/sh

slave1=ens4f0
slave2=ens4f1

modprobe -rv bonding
modprobe -v bonding mode=2 miimon=100 updelay=200
ip link set bond0 up
ifenslave bond0 $slave1 $slave2
sleep 5

while :
do
        ip link set $slave1 down
        sleep 1
        ip link set $slave1 up
        sleep 1
done

I get this repeating log output:

[ 9488.262291] sfc 0000:05:00.0 ens4f0: link up at 10000Mbps
full-duplex (MTU 1500)
[ 9488.339508] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9488.339511] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9488.547643] bond0: (slave ens4f0): link status definitely up, 10000
Mbps full duplex
[ 9489.276614] bond0: (slave ens4f0): link status definitely down,
disabling slave
[ 9490.273830] sfc 0000:05:00.0 ens4f0: link up at 10000Mbps
full-duplex (MTU 1500)
[ 9490.315540] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9490.315543] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9490.523641] bond0: (slave ens4f0): link status definitely up, 10000
Mbps full duplex
[ 9491.356526] bond0: (slave ens4f0): link status definitely down,
disabling slave
[ 9492.285249] sfc 0000:05:00.0 ens4f0: link up at 10000Mbps
full-duplex (MTU 1500)
[ 9492.291522] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9492.291523] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9492.499604] bond0: (slave ens4f0): link status definitely up, 10000
Mbps full duplex
[ 9493.331594] bond0: (slave ens4f0): link status definitely down,
disabling slave

"invalid new link 3 on slave" is there every single time.

Side note: I'm not actually able to reproduce the repeating "link
status up, enabling it in 200 ms" and never recovering from a downed
link on this host, no clue why it's so reproducible w/another system.

-- 
Jarod Wilson
jarod@...hat.com