linux-kernel - Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170614155531.GO3721@linux.vnet.ibm.com>
Date:   Wed, 14 Jun 2017 08:55:31 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Krister Johansen <kjlx@...pleofstupid.com>,
        Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
        Paul Gortmaker <paul.gortmaker@...driver.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH tip/sched/core] Add comments to aid in safer usage of
 swake_up.

On Wed, Jun 14, 2017 at 09:10:15AM -0400, Steven Rostedt wrote:
> On Tue, 13 Jun 2017 20:58:43 -0700
> "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
> 
> > And here is the part you also need to look at:
> 
> Why? We are talking about two different, unrelated variables modified
> on two different CPUs. I don't see where the overlap is.

It does sound like we are talking past each other.

Please see below for how I was interpreting your sequence of events.

> > ====
> > 
> >  (*) Overlapping loads and stores within a particular CPU will appear to be
> >      ordered within that CPU.  This means that for:
> > 
> > 	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
> > 
> >      the CPU will only issue the following sequence of memory operations:
> > 
> > 	a = LOAD *X, STORE *X = b
> > 
> >      And for:
> > 
> > 	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
> > 
> >      the CPU will only issue:
> > 
> > 	STORE *X = c, d = LOAD *X
> > 
> >      (Loads and stores overlap if they are targeted at overlapping pieces of
> >      memory).
> > 
> > ====
> > 
> > This section needs some help -- the actual guarantee is stronger, that
> > all CPUs will agree on the order of volatile same-sized aligned accesses
> > to a given single location.  So if a previous READ_ONCE() sees the new
> > value, any subsequent READ_ONCE() from that same variable is guaranteed
> > to also see the new value (or some later value).
> > 
> > Does that help, or am I missing something here?
> 
> Maybe I'm missing something. Let me rewrite what I first wrote, and
> then abstract it into a simpler version:
> 
> Here's what I first wrote:
> 
> (looking at __call_rcu_core() and rcu_gp_kthread()
> 
> 	CPU0				CPU1
> 	----				----
> 				__call_rcu_core() {
> 
> 				 spin_lock(rnp_root)
> 				 need_wake = __rcu_start_gp() {
> 				  rcu_start_gp_advanced() {
> 				   gp_flags = FLAG_INIT
> 				  }
> 				 }
> 
>  rcu_gp_kthread() {
>    swait_event_interruptible(wq,
> 	gp_flags & FLAG_INIT) {

This is the first access to ->gp_flags from rcu_gp_kthread().

>    spin_lock(q->lock)
> 
> 				*fetch wq->task_list here! *
> 
>    list_add(wq->task_list, q->task_list)
>    spin_unlock(q->lock);
> 
>    *fetch old value of gp_flags here *

This is the second access to ->gp_flags.

Since you are saying that ->gp_flags is only accessed once, perhaps
this code from spin_lock() down is intended to be an expansion of
swait_event_interruptible()?

#define swait_event_interruptible(wq, condition)			\
({									\
	int __ret = 0;							\
	if (!(condition))						\
		__ret = __swait_event_interruptible(wq, condition);	\
	__ret;								\
})

But no, in this case, we have the macro argument named "condition"
accessing ->gp_flags, and a control dependency forcing that access to
precede the spin_lock() in __prepare_to_swait().  We cannot acquire the
spinlock unless the condition is false, that is, the old value is fetched.
So there is a first fetch of ->gp_flags that is constrained to happen
before the spin_lock().  Any fetch of ->gp_flags after the spin_unlock()
must therefore be a second fetch.  Which of course might still get the
old value because the update to ->gp_flags might not have propagated yet.

But it appears that you are worried about something else.

> 				 spin_unlock(rnp_root)
> 
> 				 rcu_gp_kthread_wake() {
> 				  swake_up(wq) {
> 				   swait_active(wq) {
> 				    list_empty(wq->task_list)

We don't hold q->lock here, so I am guessing that your concern is that
we aren't guaranteed to see the above list_add().

Is that the case?

If so, your suggested fix is to place an smp_mb() between
swait_event_interruptible()'s access to "condition" and
__prepare_to_swait()'s list_add(), correct?  And also an
smp_mb() before swake_up()'s call to swait_active(), correct?

The second smp_mb() could be placed by the user, but the first
one cannot, at least not reasonably.

So did I get the point eventually?  ;-)

							Thanx, Paul

> 				   } * return false *
> 
>   if (condition) * false *
>     schedule();
> 
> 
> Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> where applicable.
> 
> 
> 	CPU0				CPU1
> 	----				----
> 				LOCK(A)
> 
>  LOCK(B)
> 				 WRITE_ONCE(X, INIT)
> 
> 				 (the cpu may postpone writing X)
> 
> 				 (the cpu can fetch wq list here)
>   list_add(wq, q)
> 
>  UNLOCK(B)
> 
>  (the cpu may fetch old value of X)
> 
> 				 (write of X happens here)
> 
>  if (READ_ONCE(X) != init)
>    schedule();
> 
> 				UNLOCK(A)
> 
> 				 if (list_empty(wq))
> 				   return;
> 
> Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> scenario?
> 
> Because we are using spinlocks, this wont be an issue for most
> architectures. The bug happens if the fetching of the list_empty()
> leaks into before the UNLOCK(A).
> 
> If the reading/writing of the list and the reading/writing of gp_flags
> gets reversed in either direction by the CPU, then we have a problem.
> 
> -- Steve
>