linux-kernel - Re: call_rcu data race patch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210917211148.GU4156@paulmck-ThinkPad-P17-Gen-1>
Date:   Fri, 17 Sep 2021 14:11:48 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Guillaume Morin <guillaume@...infr.org>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: call_rcu data race patch

On Fri, Sep 17, 2021 at 09:15:57PM +0200, Guillaume Morin wrote:
> Hello Paul,
> 
> I've been researching some RCU warnings we see that lead to full lockups
> with longterm 5.x kernels.
> 
> Basically the rcu_advance_cbs() == true warning in
> rcu_advance_cbs_nowake() is firing then everything eventually gets
> stuck on RCU synchronization because the GP thread stays asleep while
> rcu_state.gp_flags & 1 == 1 (this is a bunch of nohz_full cpus)
> 
> During that search I found your patch from July 12th
> https://www.spinics.net/lists/rcu/msg05731.html that seems related (all
> warnings we've seen happened in the __fput call path). Is there a reason
> this patch was not pushed? Is there an issue with this patch or did it
> fall just through the cracks?

It is still in -rcu:

2431774f04d1 ("rcu: Mark accesses to rcu_state.n_force_qs")

It is slated for the v5.16 merge window.  But does it really fix the
problem that you are seeing?

> Thanks in advance for your help,
> 
> Guillaume.
> 
> PS: FYI during my research, I've found another similar report in bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208685

Huh.  First I have heard of it.  It looks like they hit this after about
nine days of uptime.  I have run way more than nine days of testing of
nohz_full RCU operation with rcutorture, and have never seen it myself.

Can you reproduce this?  If so, can you reproduce it on mainline kernels
(as opposed to -stable kernels as in that bugzilla)?

The theory behind that WARN_ON_ONCE() is as follows:

o	The check of rcu_seq_state(rcu_seq_current(&rnp->gp_seq))
	says that there is a grace period either in effect or just
	now ending.

o	In the latter case, the grace-period cleanup has not yet
	reached the current rcu_node structure, which means that
	it has not yet checked to see if another grace period
	is needed.

o	Either way, the RCU_GP_FLAG_INIT will cause the next grace
	period to start.  (This flag is protected by the root
	rcu_node structure's ->lock.)

Again, can you reproduce this, especially in mainline?

							Thanx, Paul