netdev - Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfmpSe1o9_eFu70PPHT9MF5tMYjZqgajCfpDHCnsHQBUdcW0Q@mail.gmail.com>
Date:   Wed, 9 Dec 2020 10:08:05 -0500
From:   Jarod Wilson <jarod@...hat.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Mahesh Bandewar <maheshb@...gle.com>,
        Jay Vosburgh <j.vosburgh@...il.com>,
        Veaceslav Falico <vfalico@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>,
        "David S. Miller" <davem@...emloft.net>,
        Thomas Davis <tadavis@....gov>, Netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

On Tue, Dec 8, 2020 at 2:38 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Sat,  5 Dec 2020 18:43:54 -0500 Jarod Wilson wrote:
> > I'm seeing a system get stuck unable to bring a downed interface back up
> > when it's got an updelay value set, behavior which ceased when logging
> > spew was removed from bond_miimon_inspect(). I'm monitoring logs on this
> > system over another network connection, and it seems that the act of
> > spewing logs at all there increases rtnl lock contention, because
> > instrumented code showed bond_mii_monitor() never able to succeed in it's
> > attempts to call rtnl_trylock() to actually commit link state changes,
> > leaving the downed link stuck in BOND_LINK_DOWN. The system in question
> > appears to be fine with the log spew being moved to
> > bond_commit_link_state(), which is called after the successful
> > rtnl_trylock().
>
> But it's not called under rtnl_lock AFAICT. So something else is also
> spewing messages?
>
> While bond_commit_link_state() _is_ called under the lock. So you're
> increasing the retry rate, by putting the slow operation under the
> lock, is that right?

Partially, yes. I probably should have tagged this with RFC instead of
PATCH, tbh. My theory was that the log spew, being sent out *other*
network interfaces when monitoring the system or remote syslog or ssh
was potentially causing some rtnl_lock() calls, so not spewing until
after actually being able to grab the lock would lessen the problem
w/actually acquiring the lock, but I ... don't know offhand how to
verify that theory.


> Also isn't bond_commit_link_state() called from many more places?
> So we're adding new prints, effectively?

Ah. Crap. Yes. bond_set_slave_link_state() is called quite a few
places, and that in turn calls bond_commit_link_state().


> > I'm actually wondering if perhaps we ultimately need/want
> > some bond-specific lock here to prevent racing with bond_close() instead
> > of using rtnl, but this shift of the output appears to work. I believe
> > this started happening when de77ecd4ef02 ("bonding: improve link-status
> > update in mii-monitoring") went in, but I'm not 100% on that.
> >
> > The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat
> > separate from the fix for the actual hang, but it eliminates a constant
> > "invalid new link 3 on slave" message seen related to this issue, and it's
> > not actually an invalid state here, so we shouldn't be reporting it as an
> > error.
>
> Let's make it a separate patch, then.

Sounds like Jay is confident that bit is valid, and I shouldn't be
ending up in that state, unless something else is going wrong.

-- 
Jarod Wilson
jarod@...hat.com