[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210929232734.GE880162@paulmck-ThinkPad-P17-Gen-1>
Date: Wed, 29 Sep 2021 16:27:34 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Vladimir Oltean <vladimir.oltean@....com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
rcu <rcu@...r.kernel.org>, Roopa Prabhu <roopa@...dia.com>,
Nikolay Aleksandrov <nikolay@...dia.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Tobias Waldekranz <tobias@...dekranz.com>,
Florian Fainelli <f.fainelli@...il.com>,
Andrew Lunn <andrew@...n.ch>,
Vivien Didelot <vivien.didelot@...il.com>,
Vladimir Oltean <olteanv@...il.com>
Subject: Re: [RFC PATCH net] net: dsa: tag_dsa: fix suspicious
rcu_dereference_check() with br_vlan_get_pvid_rcu
On Wed, Sep 29, 2021 at 09:28:22PM +0000, Vladimir Oltean wrote:
> On Wed, Sep 29, 2021 at 10:53:33AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 29, 2021 at 02:37:08AM +0300, Vladimir Oltean wrote:
> > > __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> > > but in my foolishness I had thought this would be enough to make the
> > > access, lockdep complains that rcu_read_lock() is not held.
> >
> > Depending on exactly which primitive is complaining, you can inform
> > lockdep of your intentions. For example, you can change
> > rcu_dereference(p) to rcu_dereference_bh(p). Or you can change:
> >
> > list_for_each_entry_rcu(p, &lh, field) {
> > ...
> >
> > To:
> >
> > list_for_each_entry_rcu(p, &lh, field, rcu_read_lock_bh_held()) {
> > ...
> >
> > And hlist_for_each_entry_rcu() can also take that same optional
> > lockdep parameter.
>
> This is covered by my first option below, basically I don't want to
> bloat the Ethernet bridge driver too much with a single user, and that
> user being outside the bridge code itself, at that. I'm sure Nikolay and
> Roopa would agree :)
No need to duplicate code. You can pass in both like this:
rcu_dereference_check(p, rcu_read_lock_bh_held() || rcu_read_lock_held());
But if just switching rcu_read_lock() really works for you, that is
easier.
> The bridge xmit function - br_dev_xmit - takes the rcu_preempt
> rcu_read_lock() too. Although I don't think it is very high-overhead in
> any incarnation, I think it would be seen as a positive improvement if
> it could be removed from that path too (or would it?)
Both rcu_read_lock() and preempt_disable() cost very little, but it
is not zero cost. The real question is of course "can you see it in a
system-level benchmark?" Or even in a microbenchmark.
> > > Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> > > different flavors, and although Paul McKenney has consolidated
> > > synchronize_rcu() to wait for both preempt as well as bh read-side
> > > critical sections [1], the reader-side API is different, the lockdep
> > > maps and keys are different.
> > >
> > > The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> > > wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> > > even though it is in an rcu-bh read side section. So even though this is
> > > in premise safe, to lockdep this is a case of "who are you? I don't know
> > > you, you're suspicious".
> > >
> > > Side note, I still don't really understand the different RCU flavors.
> >
> > RCU BH was there to handle denial-of-service networking loads.
> > Changes over the years to RCU and to softirq have rendered it obsolete.
> > But rcu_read_lock_bh() still disables softirq for you.
>
> Thank you, I guess? :)
You are welcome, I guess? ;-)
> > RCU Sched provides the original semantics.
> >
> > RCU Preempt, as the name suggests, allows RCU readers to be preempted.
> > Of course, if you are using rcu_read_lock_sched() or rcu_read_lock_bh(),
> > you are disabling preemption across the critical section.
> >
> > > For example, as far as I can see, the core network stack has never
> > > directly called synchronize_rcu_bh, not even once. Just the initial
> > > synchronize_kernel(), replaced later with the RCU preempt variant -
> > > synchronize_rcu(). Very very long story short, dev_queue_xmit has
> > > started calling this exact variant - rcu_read_lock_bh() - since [2], to
> > > make dev_deactivate properly wait for network interfaces with
> > > NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> > > existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> > > that synchronize_net() never really waited for the rcu-bh critical
> > > section in dev_queue_xmit to finish? I've no idea.
> >
> > The pre-consolidation Linux kernel v4.16 has these calls to
> > synchronize_rcu_bh():
> >
> > drivers/net/team/team.c team_port_disable_netpoll 1094 synchronize_rcu_bh();
> > drivers/vhost/net.c vhost_net_release 1027 synchronize_rcu_bh();
> > net/netfilter/ipset/ip_set_hash_gen.h mtype_resize 667 synchronize_rcu_bh();
> >
> > But to your point, nothing in net/core.
> >
> > And for v4.16 kernels build with CONFIG_PREEMPT=y, there is no guarantee
> > that synchronize_rcu() will wait for a rcu_read_lock_bh() critical
> > section. A CPU in such a critical section could take a scheduling-clock
> > interrupt, notice that it was not in an rcu_read_lock() critical section,
> > and report a quiescent state, which could well end that grace period.
>
> I find it really hard to believe that commit d4828d85d188
> ("[NET]: Prevent transmission after dev_deactivate") did not provide the
> guarantee it promised. It seems much more likely that I'm missing something,
> although I don't see what :)
You guys are the ones who get bitten if there is a problem, so I will
let you guys check it out. Or not, at your option. ;-)
> But the team driver, which I did notice, and which you've linked to
> above as well, did have a comment which suggested that yes, if you don't
> call synchronize_rcu_bh(), you don't really wait for __dev_queue_xmit()
> to finish.
And that did change, but then again I changed that to synchronize_rcu()
late in the consolidation effort.
> > But as you say, in more recent kernels, synchronize_rcu() will indeed
> > wait for rcu_read_lock_bh() critical sections.
> >
> > But please be very careful when backporting.
>
> No concerns with backporting, the code in question was added last month
> or so.
Very good!
> > > So basically there are multiple options.
> > >
> > > First would be to duplicate br_vlan_get_pvid_rcu() into a new
> > > br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> > > this function already has another brother, br_vlan_get_pvid(), which is
> > > protected by the update-side rtnl_mutex. We don't want to grow the
> > > family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> > > be a function used by the bridge at all, just exported by it and used by
> > > the DSA layer.
> > >
> > > The option of getting to the bottom of why does __dev_queue_xmit use
> > > rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> > > was before [3], might be impractical. There have been 15 years of
> > > development since then, and there are lots of code paths that use
> > > rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> > > done in [1], I'm not even sure what are the practical benefits of rcu-bh
> > > any longer, if the whole point was for synchronize_rcu() to wait for
> > > everything in sight - how can spammy softirqs like networking paint
> > > themselves red any longer, and how can certain RCU updaters not wait for
> > > them now, in order to avoid denial of service? It doesn't appear
> > > possible from the distance from which I'm looking at the problem.
> > > So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> > > would only appear justified if it went together with the complete
> > > elimination of rcu-bh. Also, it would appear to be quite a strange and
> > > roundabout way to fix a "suspicious RCU usage" lockdep message.
> >
> > The thing to be very careful of is code that might be implicitly assuming
> > that it cannot be interrupted by a softirq handler. This assumption will
> > of course be violated by changing rcu_read_lock_bh() to rcu_read_lock().
> > The resulting low-probability subtle breakage might be hard to find.
>
> Of course, the networking code would not change functionally with the
> removal of rcu_read_lock_bh(). So rcu_read_lock_bh() would be
> transformed into local_bh_disable() + rcu_read_lock(). I am still a bit
> unclear on the details, but there are reasons why we need softirqs
> disabled - we don't call hard_start_xmit just from the NET_TX softirq,
> that would be just too nice :)
>
> > > Last, it appears possible to just give lockdep what it wants, and hold
> > > an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> > > from the TX path. In terms of lines of code and amount of thought needed
> > > it is certainly the easiest path forward, even though it incurs a small
> > > (negligible) performance overhead (and avoidable, at that). This is what
> > > this patch does, in lack of a deeper understanding of lockdep, RCU or
> > > the network transmission process.
> > >
> > > [1] https://lwn.net/Articles/777036/
> > > [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> > > [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
> > >
> > > Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@....com>
> >
> > Of course, if no one really needs rcu_read_lock_bh() anymore, I would be
> > quite happy to simplify my life by getting rid of it. ;-)
> >
> > Thanx, Paul
>
> The basic idea that updaters could be resilient against softirq storms
> sounds great in principle. Although if I understand correctly, that went
> away with the consolidation. So again, it isn't that some resiliency
> wouldn't be nice, but I'm looking at the current code and I just don't
> see what the benefits of rcu_bh are. If the answer is as self-evident as
> a naive person like me thinks it is - i.e. rcu_read_lock_bh is just a
> glorified version of rcu_read_lock which also disables softirqs, just
> with different lockdep rules - then is the consolidation really complete?
> Couldn't the bh-disable readers be modified to just open-code the
> disabling of softirqs, and resolve the lockdep issues that ensue from
> having a separate flavor?
As part of the consolidation effort, I made synchronize_rcu()
also be resilient against softirq storms in the same way that
synchronize_rcu_bh() used to be. For example, that is why there is now
a call to rcu_softirq_qs() in __do_softirq().
Why didn't we do it that way to start with? Because this approach relies
on a bunch of subsequent improvements in softirq in general and ksoftirqd
in particular.
If you change all instance of "rcu_read_lock_bh()" to instead be
"local_bh_disable(); rcu_read_lock()", and of "rcu_read_unlock_bh()"
to "rcu_read_unlock(); local_bh_enable()" that would work. But
there are probably many that can now be just "rcu_read_lock()" and
"rcu_read_unlock()". Figuring out which is which is a task for someone
who understands that code way better than I do.
And for all I know, there might be some situation that really wants
and rcu_read_lock_bh() for some reason. But that does seem to me
to be unlikely, and it would be good to get rid of the _bh() variants.
If the uses go away, I am more than happy to get rid of the definitions!
Thanx, Paul
> > > ---
> > > net/dsa/tag_dsa.c | 5 ++---
> > > 1 file changed, 2 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> > > index 77d0ce89ab77..178464cd2bdb 100644
> > > --- a/net/dsa/tag_dsa.c
> > > +++ b/net/dsa/tag_dsa.c
> > > @@ -150,10 +150,9 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
> > > * that's where the packets ingressed from.
> > > */
> > > if (!br_vlan_enabled(br)) {
> > > - /* Safe because __dev_queue_xmit() runs under
> > > - * rcu_read_lock_bh()
> > > - */
> > > + rcu_read_lock();
> > > err = br_vlan_get_pvid_rcu(br, &pvid);
> > > + rcu_read_unlock();
> > > if (err)
> > > return NULL;
> > > }
> > > --
> > > 2.25.1
> > >
Powered by blists - more mailing lists