linux-kernel - Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20180626234004.GQ3593@linux.vnet.ibm.com>
Date:   Tue, 26 Jun 2018 16:40:04 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, mingo@...nel.org,
        jiangshanlai@...il.com, dipankar@...ibm.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
        dhowells@...hat.com, edumazet@...gle.com, fweisbec@...il.com,
        oleg@...hat.com, joel@...lfernandes.org
Subject: Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to
 race with CPU offline

On Tue, Jun 26, 2018 at 10:32:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> > commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> > Author: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> > Date:   Fri May 25 19:23:09 2018 -0700
> > 
> >     rcu: Fix grace-period hangs due to race with CPU offline
> >     
> >     Without special fail-safe quiescent-state-propagation checks, grace-period
> >     hangs can result from the following scenario:
> >     
> >     1.      CPU 1 goes offline.
> >     
> >     2.      Because CPU 1 is the only CPU in the system blocking the current
> >             grace period, the grace period ends as soon as
> >             rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
> >             returns.
> 
> My current code doesn't have that call... So this is a new problem
> earlier in this series.
> 
> >     3.      At this point, the leaf rcu_node structure's ->lock is no longer
> >             held: rcu_report_qs_rnp() has released it, as it must in order
> >             to awaken the RCU grace-period kthread.
> >     
> >     4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> >             field still records CPU 1 as being online.  This is absolutely
> >             necessary because the scheduler uses RCU (in this case on the
> >             wake-up path while awakening RCU's grace-period kthread), and
> >             ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
> >             Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
> >             bit from ->qsmaskinitnext would result in a lockdep-RCU splat
> >             due to RCU being used from an offline CPU.
> 
> Argh.. so it's your own wakeup!

That it is.  The outgoing CPU might need to wake up RCU's grace-period
kthread, for example, if that CPU was the last thing holding up the
current grace period.

> This all still smells really bad. But let me try and figure out where
> you introduced the problem.

Well, I can't say that it is the code that I am the most proud of...

On the other hand, there are a number of constraints on this code:

o	The ->qsmask bits indicate which CPUs the current grace period
	is still waiting on.

o	The ->qsmaskinit bits indicate which CPUs the current grace
	period was waiting on at grace-period initialization time.
	In fact, the ->qsmask bits are initialized by copying from
	the corresponding ->qsmaskinit bits.

o	The ->qsmaskinitnext bits indicate which CPUs are online at
	a given time, from an RCU perspective.	RCU will consider
	an incoming CPU to be online before cpu_online_mask does,
	and RCU will also consider an outgoing CPU to be online after
	cpu_online_mask says that it is gone.

o	Only the leaf rcu_node structures' ->qsmaskinitnext bits have
	meaning.  Thus, although the ->qsmaskinit bits are initialized
	for a given grace period based on changes in the values of
	the ->qsmaskinitnext bits since the beginning of the last
	grace period, for the non-leaf ->qsmaskinit bits, this cannot
	be a simple copy operation.  (Instead, changes are pushed up
	the rcu_node tree.)

	Why not instead make the CPU-hotplug operations push the
	changes up the tree?  Because this could race with grace-period
	initialization, in which case it would result in inconsistent
	values for the ->qsmaskinit bits, and thus of the ->qsmask bits.
	These inconsistent values can result in too-short grace periods
	on the one hand and grace-period hangs on the other.

o	The purpose of these three bitmasks is to decouple RCU's
	grace-period initialization from CPU-hotplug operations.
	The reason for this decoupling is that various pieces of the
	CPU-hotplug code wait on RCU grace periods, which means that
	RCU grace-period initialization cannot block CPU hotplug,
	at least not without risk of deadlock.

o	The initial state of all the ->qsmask bits must be consistent (at
	least in the case where no CPU tries to report a quiescent state
	until all these bits are fully initialized).  That is to say,
	if the initial value of a given ->qsmask is all zero (and there
	are no preempted tasks referenced by the ->gp_tasks pointer,
	see below), then the initial value of the corresponding bit of
	the parent rcu_node structure must also be zero.  Similarly, if
	the initial value of a given ->qsmask has at least one non-zero
	bit (or there is at least one task referenced by the ->gp_tasks
	pointer), then the initial value of the corresponding bit of
	the parent rcu_node structure must also be non-zero.  Of course,

	Violation of this rule was the impetus for several of the
	patches under discussion.

o	Only the CPU-hotplug code is allowed to change the values of
	the ->qsmaskinitnext bits.  The grace-period initialization code
	reads these bits, but so does code checking for offline CPUs.
	The rcu_node structure's ->lock must be held when changing
	these bits.

o	Only the grace-period initialization code is allowed to access the
	->qsmaskinit bits (give or take debug prints).	In theory, no lock
	is required when updating these bits, but in practice the rcu_node
	structure's ->lock is held due to the need to hold either the
	->qsmask or the ->qsmaskinitnext bits still while carrying out
	the access.

o	Only the grace-period initialization code is allowed to change
	the values of the ->qsmask bits from zero to one.  Only something
	reporting a quiescent state is allowed to change the values of
	the corresponding ->qsmask bits from one to zero.  If the
	->qsmask field of the root rcu_node structure becomes zero,
	the RCU grace-period kthread should be awakened (one execption
	being when that kthread cleared the last bit, and another where
	the root is also a leaf and there are tasks referenced by the
	->gp_tasks pointer, again, see below).

o	Grace-period initialization first scans the leaf rcu_node
	structures for CPU-hotplug transitions (locking each such
	structure along the way and updating their ->qsmaskinit field),
	then does a breadth-first copy of ->qsmaskinit to ->qsmask for
	each rcu_node structure.  CPUs can come and go at any point
	in this process.  In fact, one CPU can go and another
	(or even the same CPU) can come back during this part of
	grace-period initialization.

o	Leaf rcu_node structures must concern themselves with tasks
	preempted within an RCU-preempt read-side critical section as well
	as CPUs passing through quiescent states.  These preempted tasks
	are placed on the corresponding structure's ->blkd_tasks list.
	It is possible that a grace period might start such that a given
	leaf must wait on a preempted task (non-NULL ->gp_tasks pointer)
	but no online CPUs.  In addition, a CPU could come online
	while this state was in effect.  The boolean ->wait_blkd_tasks
	field helps keep all this straight, allowing that leaf rcu_node
	structure to be taken out of circulation once all its CPUs are
	offline -and- its ->blkd_tasks list has become empty.

o	As soon as a ->qsmask bit is set in a leaf rcu_node structure,
	the CPU corresponding to that bit might immediately start trying
	to report a quiescent state.  This means that the ->qsmask bits
	must be initialized from the root down, in a breadth-first
	fashion.  Otherwise, a quiescent-state report could cause the
	->qsmask bits to become inconsistent (if the report outran
	the bottom-to-top initialization).

o	In the past, the RCU grace-period kthread's force-quiescent-state
	scan would catch CPU-offline races, but the goal is to get rid
	of that fail-safe in order to reduce lock contention.  Therefore,
	if a given leaf rcu_node structure's ->qsmask field has a bit set,
	then that CPU needs to be online at least at the very beginning
	of the grace period.

The options I have considered are as follows:

1.	Stick with the no-failsafe approach, adding the lock as shown
	in this patch.  (I obviously prefer this approach.)

2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
	kthread to wake up later due to its timed wait during the
	force-quiescent-state process.  This would be a bit obnoxious,
	as it requires passing a don't-wake flag (or some such) up the
	quiescent-state reporting mechanism.  It would also needlessly
	delay grace-period ends, especially on large systems (RCU scales
	up the FQS delay on larger systems to maintain limited CPU
	consumption per unit time).

3.	Stick with the no-failsafe approach, but have the quiescent-state
	reporting code hand back a value indicating that a wakeup is needed.
	Also a bit obnoxious, as this value would need to be threaded up
	the reporting code's return path.  Simple in theory, but a bit
	of an ugly change, especially for the many places in the code that
	currently expect quiescent-state reporting to be an unconditional
	fire-and-forget operation.

4.	Re-introduce the old fail-safe code, and don't report the
	quiescent state at CPU-offline time, relying on the fail-safe
	code to do so.	This also potentially introduces delays and can
	add extra FQS scans, which in turn increases lock contention.

So what did you have in mind?

								Thanx, Paul