[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210917211148.GU4156@paulmck-ThinkPad-P17-Gen-1>
Date: Fri, 17 Sep 2021 14:11:48 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Guillaume Morin <guillaume@...infr.org>
Cc: linux-kernel@...r.kernel.org
Subject: Re: call_rcu data race patch
On Fri, Sep 17, 2021 at 09:15:57PM +0200, Guillaume Morin wrote:
> Hello Paul,
>
> I've been researching some RCU warnings we see that lead to full lockups
> with longterm 5.x kernels.
>
> Basically the rcu_advance_cbs() == true warning in
> rcu_advance_cbs_nowake() is firing then everything eventually gets
> stuck on RCU synchronization because the GP thread stays asleep while
> rcu_state.gp_flags & 1 == 1 (this is a bunch of nohz_full cpus)
>
> During that search I found your patch from July 12th
> https://www.spinics.net/lists/rcu/msg05731.html that seems related (all
> warnings we've seen happened in the __fput call path). Is there a reason
> this patch was not pushed? Is there an issue with this patch or did it
> fall just through the cracks?
It is still in -rcu:
2431774f04d1 ("rcu: Mark accesses to rcu_state.n_force_qs")
It is slated for the v5.16 merge window. But does it really fix the
problem that you are seeing?
> Thanks in advance for your help,
>
> Guillaume.
>
> PS: FYI during my research, I've found another similar report in bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208685
Huh. First I have heard of it. It looks like they hit this after about
nine days of uptime. I have run way more than nine days of testing of
nohz_full RCU operation with rcutorture, and have never seen it myself.
Can you reproduce this? If so, can you reproduce it on mainline kernels
(as opposed to -stable kernels as in that bugzilla)?
The theory behind that WARN_ON_ONCE() is as follows:
o The check of rcu_seq_state(rcu_seq_current(&rnp->gp_seq))
says that there is a grace period either in effect or just
now ending.
o In the latter case, the grace-period cleanup has not yet
reached the current rcu_node structure, which means that
it has not yet checked to see if another grace period
is needed.
o Either way, the RCU_GP_FLAG_INIT will cause the next grace
period to start. (This flag is protected by the root
rcu_node structure's ->lock.)
Again, can you reproduce this, especially in mainline?
Thanx, Paul
Powered by blists - more mailing lists