linux-kernel - Re: [PATCH] rcu: Is it safe to enter an RCU read-side critical section?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 9 Sep 2013 06:23:43 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Frederic Weisbecker <fweisbec@...il.com>,
	Eric Dumazet <eric.dumazet@...il.com>,
	linux-kernel@...r.kernel.org, mingo@...e.hu, laijs@...fujitsu.com,
	dipankar@...ibm.com, akpm@...ux-foundation.org,
	mathieu.desnoyers@...icios.com, josh@...htriplett.org,
	niv@...ibm.com, tglx@...utronix.de, rostedt@...dmis.org,
	dhowells@...hat.com, edumazet@...gle.com, darren@...art.com,
	sbw@....edu, cl@...ux.com
Subject: Re: [PATCH] rcu: Is it safe to enter an RCU read-side critical
 section?

On Mon, Sep 09, 2013 at 12:53:47PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 06, 2013 at 08:59:29PM +0200, Frederic Weisbecker wrote:
> > Imagine that you're running on an rcu read side critical section on CPU 0, which
> > is not in extended quiescent state. Now you get preempted in the middle of your
> > RCU read side critical section (you called rcu_read_lock() but not yet rcu_read_unlock()).
> > 
> > Later on, the task is woken up to be scheduled in CPU 1. If CPU 1 is in extended
> > quiescent state because it runs is userspace, it receives a scheduler IPI,
> > then schedule_user() is called by the end of the interrupt and in turns calls rcu_user_exit()
> > before the task is resumed to the code it was running on CPU 0, in the middle of
> > the rcu read side extended quiescent state.
> > 
> > See, the key here is the rcu_user_exit() that restore the CPU on RCU's state machine.
> > There are other possible scheduler entrypoints when a CPU runs in user extended quiescent
> > state: exception and syscall entries or even preempt_schedule_irq() in case we receive an irq
> > in the kernel while we haven't yet reached the call to rcu_user_exit()... All of these should
> > be covered, otherwise you bet RCU would be prompt to warn.
> > 
> > That's why when we call rcu_is_cpu_idle() from an RCU read side critical section, it's legit even
> > if we can be preempted anytime around it.
> > And preempt_disable() is probably not even necessary, except perhaps if __get_cpu_var() itself
> > relies on non-preemptibility for its own correctness on the address calculation.
> 
> I've tried reading that trice now, still not making much sense.

Sorry, Frederic really is describing what is going on here.

And it really does work.

> In any case rcu_is_cpu_idle() is complete bollocks, either use
> __raw_get_cpu_var() and add a _coherent_ explanation for why its right,
> or its broken.

Hmmmm...  Adding Christoph Lameter on CC, since he was the one pushing
for the current formulation of that line of rcu_is_cpu_idle().

And guys, I have to say that the advice on which per-CPU primitive to use
varies wildly and randomly.  For all I know, each of you individually
might well be sticking to the same story, but taken together, your
collective advice is strongly resembling white noise.

It is not that the primitives themselves are changing that quickly:
__raw_get_cpu_var() has been around for three years.

> In any case the preempt_disable/enable pair there is just plain wrong as
> Eric pointed out.

Peter, in the general case, you are quite correct.  But this is a special
case where it really does work.

The key point here is that preemption and migration cannot move a task
from a CPU to which RCU is paying attention to a CPU that RCU is ignoring.
So yes, by the time the task sees the return value from rcu_is_cpu_idle(),
that task might be running on some other CPU.  But that is OK, because
if RCU was paying attention to the old CPU, then RCU must also be paying
attention to the new CPU.

Frederic's description gives the details of how this is enforced.

Here is an example of how this works:

1.	Some task running on a CPU 0 (which RCU is paying attention to)
	calls rcu_is_cpu_idle(), which disables preemption, checks the
	per-CPU variable, sets ret to zero, then enables preemption.

	At this point, the task is preempted by some high-priority task.

2.	CPU 1 is currently idle, so RCU is -not- paying attention to it.
	However, it is decided that our low-priority task should migrate
	to CPU 1.

3.	CPU 1 is sent an IPI, which forces this CPU out of idle.  This
	causes rcu_idle_exit() to be called, which causes RCU to start
	paying attention to CPU 1.

4.	CPU 1 switches to the low-priority task, which now sees the
	return value of rcu_is_cpu_idle().  Now, this return value did
	in fact reflect the old state of CPU 0, and the state of CPU 0
	might have changed.  (For example, the high-priority task might
	have blocked, so that CPU 0 is now idle, which in turn would
	mean that RCU is no longer paying attention to it, so that
	if rcu_is_cpu_idle() was called right now, it would return
	true rather than the false return computed in step 1 above.)

5.	But that is OK.  Because of the way RCU and idle interact,
	if a call from a given task to rcu_is_cpu_idle() returned false
	some time in the past, a call from that same task will also
	return false right now.

So yes, in general it is wrong to disable preemption, grab the value
of a per-CPU variable, re-enable preemption, and then return the result.
But there are a number of special cases where it is OK, and this is
one of them.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/