linux-kernel - Re: [PATCH v4 3/3] rcu: Use _full() API to debug synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cdab57a4-8d58-41d9-a9b5-71d425a7375e@paulmck-laptop>
Date: Fri, 28 Feb 2025 11:59:55 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Boqun Feng <boqun.feng@...il.com>, RCU <rcu@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Frederic Weisbecker <frederic@...nel.org>,
	Cheung Wall <zzqq0103.hey@...il.com>,
	Neeraj upadhyay <Neeraj.Upadhyay@....com>,
	Joel Fernandes <joel@...lfernandes.org>,
	Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>
Subject: Re: [PATCH v4 3/3] rcu: Use _full() API to debug synchronize_rcu()

On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> Hello, Paul!
> 
> > > > > > 
> > > > > > Except that I got this from overnight testing of rcu/dev on the shared
> > > > > > RCU tree:
> > > > > > 
> > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 rcu_sr_normal_complete+0x5c/0x80
> > > > > > 
> > > > > > I see this only on TREE05.  Which should not be too surprising, given
> > > > > > that this is the scenario that tests it.  It happened within five minutes
> > > > > > on all 14 of the TREE05 runs.
> > > > > > 
> > > > > Hm.. This is not fun. I tested this on my system and i did not manage to
> > > > > trigger this whereas you do. Something is wrong.
> > > > 
> > > > If you have a debug patch, I would be happy to give it a go.
> > > > 
> > > I can trigger it. But.
> > > 
> > > Some background. I tested those patches during many hours on the stable
> > > kernel which is 6.13. On that kernel i was not able to trigger it. Running
> > > the rcutorture on the our shared "dev" tree, which i did now, triggers this
> > > right away.
> > 
> > Bisection?  (Hey, you knew that was coming!)
> > 
> Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start detection
> 
> After revert in the dev, rcutorture passes TREE05, 16 instances.

Huh.  We sure don't get to revert that one...

Do we have a problem with the ordering in rcu_gp_init() between the calls
to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For example,
do we need to capture the relevant portion of the list before the call
to rcu_seq_start(), and do the grace-period-start work afterwards?

My kneejerk (and possibibly completely wrong) guess is that rcu_gp_init()
calls rcu_gp_start(), then there is a call to synchronize_rcu() whose
cookie says wait for the end of the next grace period, then we capture
the lists including this one that needs to wait longer.  Then when we
look at the end of the grace period, boom!  This would be a real bug due
to some CPU coming online between the time of the call to rcu_gp_start()
and synchronize_rcu().

Or is there some other way that this can happen?

							Thanx, Paul