[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210929175333.GW880162@paulmck-ThinkPad-P17-Gen-1>
Date: Wed, 29 Sep 2021 10:53:33 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Vladimir Oltean <vladimir.oltean@....com>
Cc: netdev@...r.kernel.org, rcu <rcu@...r.kernel.org>,
Roopa Prabhu <roopa@...dia.com>,
Nikolay Aleksandrov <nikolay@...dia.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Tobias Waldekranz <tobias@...dekranz.com>,
Florian Fainelli <f.fainelli@...il.com>,
Andrew Lunn <andrew@...n.ch>,
Vivien Didelot <vivien.didelot@...il.com>,
Vladimir Oltean <olteanv@...il.com>
Subject: Re: [RFC PATCH net] net: dsa: tag_dsa: fix suspicious
rcu_dereference_check() with br_vlan_get_pvid_rcu
On Wed, Sep 29, 2021 at 02:37:08AM +0300, Vladimir Oltean wrote:
> __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> but in my foolishness I had thought this would be enough to make the
> access, lockdep complains that rcu_read_lock() is not held.
Depending on exactly which primitive is complaining, you can inform
lockdep of your intentions. For example, you can change
rcu_dereference(p) to rcu_dereference_bh(p). Or you can change:
list_for_each_entry_rcu(p, &lh, field) {
...
To:
list_for_each_entry_rcu(p, &lh, field, rcu_read_lock_bh_held()) {
...
And hlist_for_each_entry_rcu() can also take that same optional
lockdep parameter.
> Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> different flavors, and although Paul McKenney has consolidated
> synchronize_rcu() to wait for both preempt as well as bh read-side
> critical sections [1], the reader-side API is different, the lockdep
> maps and keys are different.
>
> The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> even though it is in an rcu-bh read side section. So even though this is
> in premise safe, to lockdep this is a case of "who are you? I don't know
> you, you're suspicious".
>
> Side note, I still don't really understand the different RCU flavors.
RCU BH was there to handle denial-of-service networking loads.
Changes over the years to RCU and to softirq have rendered it obsolete.
But rcu_read_lock_bh() still disables softirq for you.
RCU Sched provides the original semantics.
RCU Preempt, as the name suggests, allows RCU readers to be preempted.
Of course, if you are using rcu_read_lock_sched() or rcu_read_lock_bh(),
you are disabling preemption across the critical section.
> For example, as far as I can see, the core network stack has never
> directly called synchronize_rcu_bh, not even once. Just the initial
> synchronize_kernel(), replaced later with the RCU preempt variant -
> synchronize_rcu(). Very very long story short, dev_queue_xmit has
> started calling this exact variant - rcu_read_lock_bh() - since [2], to
> make dev_deactivate properly wait for network interfaces with
> NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> that synchronize_net() never really waited for the rcu-bh critical
> section in dev_queue_xmit to finish? I've no idea.
The pre-consolidation Linux kernel v4.16 has these calls to
synchronize_rcu_bh():
drivers/net/team/team.c team_port_disable_netpoll 1094 synchronize_rcu_bh();
drivers/vhost/net.c vhost_net_release 1027 synchronize_rcu_bh();
net/netfilter/ipset/ip_set_hash_gen.h mtype_resize 667 synchronize_rcu_bh();
But to your point, nothing in net/core.
And for v4.16 kernels build with CONFIG_PREEMPT=y, there is no guarantee
that synchronize_rcu() will wait for a rcu_read_lock_bh() critical
section. A CPU in such a critical section could take a scheduling-clock
interrupt, notice that it was not in an rcu_read_lock() critical section,
and report a quiescent state, which could well end that grace period.
But as you say, in more recent kernels, synchronize_rcu() will indeed
wait for rcu_read_lock_bh() critical sections.
But please be very careful when backporting.
> So basically there are multiple options.
>
> First would be to duplicate br_vlan_get_pvid_rcu() into a new
> br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> this function already has another brother, br_vlan_get_pvid(), which is
> protected by the update-side rtnl_mutex. We don't want to grow the
> family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> be a function used by the bridge at all, just exported by it and used by
> the DSA layer.
>
> The option of getting to the bottom of why does __dev_queue_xmit use
> rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> was before [3], might be impractical. There have been 15 years of
> development since then, and there are lots of code paths that use
> rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> done in [1], I'm not even sure what are the practical benefits of rcu-bh
> any longer, if the whole point was for synchronize_rcu() to wait for
> everything in sight - how can spammy softirqs like networking paint
> themselves red any longer, and how can certain RCU updaters not wait for
> them now, in order to avoid denial of service? It doesn't appear
> possible from the distance from which I'm looking at the problem.
> So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> would only appear justified if it went together with the complete
> elimination of rcu-bh. Also, it would appear to be quite a strange and
> roundabout way to fix a "suspicious RCU usage" lockdep message.
The thing to be very careful of is code that might be implicitly assuming
that it cannot be interrupted by a softirq handler. This assumption will
of course be violated by changing rcu_read_lock_bh() to rcu_read_lock().
The resulting low-probability subtle breakage might be hard to find.
> Last, it appears possible to just give lockdep what it wants, and hold
> an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> from the TX path. In terms of lines of code and amount of thought needed
> it is certainly the easiest path forward, even though it incurs a small
> (negligible) performance overhead (and avoidable, at that). This is what
> this patch does, in lack of a deeper understanding of lockdep, RCU or
> the network transmission process.
>
> [1] https://lwn.net/Articles/777036/
> [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
>
> Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> Signed-off-by: Vladimir Oltean <vladimir.oltean@....com>
Of course, if no one really needs rcu_read_lock_bh() anymore, I would be
quite happy to simplify my life by getting rid of it. ;-)
Thanx, Paul
> ---
> net/dsa/tag_dsa.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> index 77d0ce89ab77..178464cd2bdb 100644
> --- a/net/dsa/tag_dsa.c
> +++ b/net/dsa/tag_dsa.c
> @@ -150,10 +150,9 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
> * that's where the packets ingressed from.
> */
> if (!br_vlan_enabled(br)) {
> - /* Safe because __dev_queue_xmit() runs under
> - * rcu_read_lock_bh()
> - */
> + rcu_read_lock();
> err = br_vlan_get_pvid_rcu(br, &pvid);
> + rcu_read_unlock();
> if (err)
> return NULL;
> }
> --
> 2.25.1
>
Powered by blists - more mailing lists