netdev - Re: [RFC PATCH net-next 05/13] net: bonding: hold the netdev lists lock when retrieving device statistics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201209011404.lkglsvtq3k45fxp2@skbuf>
Date:   Wed, 9 Dec 2020 01:14:05 +0000
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     "David S . Miller" <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Paul Gortmaker <paul.gortmaker@...driver.com>,
        Pablo Neira Ayuso <pablo@...filter.org>,
        Jiri Benc <jbenc@...hat.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Stephen Hemminger <stephen@...workplumber.org>,
        Eric Dumazet <edumazet@...gle.com>,
        George McCollister <george.mccollister@...il.com>,
        Oleksij Rempel <o.rempel@...gutronix.de>,
        Jay Vosburgh <j.vosburgh@...il.com>,
        Veaceslav Falico <vfalico@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>
Subject: Re: [RFC PATCH net-next 05/13] net: bonding: hold the netdev lists
 lock when retrieving device statistics

On Tue, Dec 08, 2020 at 04:17:37PM -0800, Jakub Kicinski wrote:
> On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote:
> > On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote:
> > > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote:
> > > > - ensuring through convention that user space always takes
> > > >   net->netdev_lists_lock when calling dev_get_stats, and documenting
> > > >   that, and therefore making it unnecessary to lock in bonding.
> > >
> > > This seems like the better option to me. Makes the locking rules pretty
> > > clear.
> >
> > It is non-obvious to me that top-level callers of dev_get_stats should
> > hold a lock as specific as the one protecting the lists of network
> > interfaces. In the vast majority of implementations of dev_get_stats,
> > that lock would not actually protect anything, which would lead into
> > just one more lock that is used for more than it should be. In my tree I
> > had actually already switched over to mutex_lock_nested. Nonetheless, I
> > am still open if you want to make the case that simplicity should prevail
> > over specificity.
>
> What are the locking rules you have in mind then? Caller may hold RTNL
> or ifc mutex?

Well, it's clear that calling mutex_lock_nested would only silence lockdep,
there would still be this non-reentrant mutex that will be held from a
potentially recursive code path. So it a non-solution, and even worse
than using plain mutex_lock, because at least that is detectable when it
locks up the system.

net_failover and bonding are the only drivers that are creating this
recursivity requirement in dev_get_stats. Other one-over-many stackable
interfaces, like the bridge, just use dev_get_tstats64. I'm almost
thinking that it would be cleaner to convert these two to dev_get_tstats64
too, that would simplify things enormously. Even team uses something
that is based on software counters, something reminiscent of
dev_get_tstats64, definitely not counters retrieved from the underlying
device. Of course, the disadvantage with doing that is going to be that
virtual interfaces cannot retrieve statistics recursively from their
lower interface. I'm trying to think how much of a real disadvantage
that will be. For offloaded interfaces they will be completely off,
that's for sure. And this is one of the reasons that mandated the DSA
rework to expose MAC-based counters in dev_get_stats in the first place.
But thinking in the larger sense. An offloading interface that supports
IP forwarding, with 50 VLAN uppers. How could the statistics counters of
those VLAN uppers ever be correct. It's not realistic to expect of the
underlying hardware to keep separate statistics for each upper, that the
upper could then go ahead and just query. Sure, in the cases where a
lower can have only one upper at a time (bridge, bonding) what I said is
not applicable, but the general goal of having accurate counters for
offloading interfaces everywhere seems to not be really achievable.

In case this doesn't work out, I guess I'll have to document that
dev_get_stats is recursive, and all mutual exclusion based locks need to
be taken upfront, which is the only strategy that works with recursion
that I know of.

> > But in that case, maybe we should just keep on using the RTNL mutex.
>
> That's a wasted opportunity, RTNL lock is pretty busy.

It is certainly easy to use and easy to enforce though. It also seems to
protect everything, which is generally the reason why you tend to not
think a lot when using it.

To be clear, I am not removing the RTNL mutex from any place where it is
held today. Letting me do that would be like letting a bull in a china
shop. I am only creating a sub-lock of it, which is protecting a subset
of what the RTNL mutex is protecting. By sub-lock I mean that code paths
that currently hold the RTNL mutex, like list_netdevice(), will also
take net->netdev_lists_lock - never the other way around, that would
create lock inversion. The plan is to then require new code that
iterates through network interface lists to use the new locking scheme
and not RTNL mutex, and in parallel to migrate, on a best-effort basis,
code that runs under ASSERT_RTNL + net->netdev_lists_lock to only take
the net->netdev_lists_lock and be RTNL-free.

But is that any better at runtime than just taking the RTNL mutex, will
it make it less contended? Nope, due to the indirect serialization
effect that is created by the fact that net->netdev_lists_lock is a
sub-lock of the RTNL mutex. Say net/core/net-procfs.c holds
net->netdev_lists_lock while dumping the statistics from a firmware and
sleeping while doing so. All is rosy in its RTNL mutex free code path.
Then there comes along another thread which calls for_each_netdev,
something else which today requires the RTNL mutex but could be reworked
to use the net->netdev_lists_lock. Let's assume we are at a stage where
the gazillion places that call for_each_netdev() have been converted to
also take the net->netdev_lists_lock. But they couldn't be converted to
also _not_ take the RTNL mutex, or at least all of them. This means that
there will be code paths that try to take the RTNL mutex and end up
waiting for our net/core/net-procfs.c to complete dumping the firmware.