linux-kernel - Re: Memory barrier needed with wake_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Pine.LNX.4.44L0.1609021526420.2027-100000@iolanthe.rowland.org>
Date:   Fri, 2 Sep 2016 16:16:54 -0400 (EDT)
From:   Alan Stern <stern@...land.harvard.edu>
To:     Peter Zijlstra <peterz@...radead.org>
cc:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Ingo Molnar <mingo@...hat.com>,
        Felipe Balbi <felipe.balbi@...ux.intel.com>,
        USB list <linux-usb@...r.kernel.org>,
        Kernel development list <linux-kernel@...r.kernel.org>
Subject: Re: Memory barrier needed with wake_up_process()?

On Fri, 2 Sep 2016, Peter Zijlstra wrote:

> On Fri, Sep 02, 2016 at 02:10:13PM -0400, Alan Stern wrote:
> > Paul, Peter, and Ingo:
> > 
> > This must have come up before, but I don't know what was decided.
> > 
> > Isn't it often true that a memory barrier is needed before a call to 
> > wake_up_process()?  A typical scenario might look like this:
> > 
> > 	CPU 0
> > 	-----
> > 	for (;;) {
> > 		set_current_state(TASK_INTERRUPTIBLE);
> > 		if (signal_pending(current))
> > 			break;
> > 		if (wakeup_flag)
> > 			break;
> > 		schedule();
> > 	}
> > 	__set_current_state(TASK_RUNNING);
> > 	wakeup_flag = 0;
> > 
> > 
> > 	CPU 1
> > 	-----
> > 	wakeup_flag = 1;
> > 	wake_up_process(my_task);
> > 
> > The underlying pattern is:
> > 
> > 	CPU 0				CPU 1
> > 	-----				-----
> > 	write current->state		write wakeup_flag
> > 	smp_mb();
> > 	read wakeup_flag		read my_task->state
> > 
> > where set_current_state() does the write to current->state and 
> > automatically adds the smp_mb(), and wake_up_process() reads 
> > my_task->state to see whether the task needs to be woken up.
> > 
> > The kerneldoc for wake_up_process() says that it has no implied memory
> > barrier if it doesn't actually wake anything up.  And even when it
> > does, the implied barrier is only smp_wmb, not smp_mb.
> > 
> > This is the so-called SB (Store Buffer) pattern, which is well known to
> > require a full smp_mb on both sides.  Since wake_up_process() doesn't
> > include smp_mb(), isn't it correct that the caller must add it
> > explicitly?
> > 
> > In other words, shouldn't the code for CPU 1 really be:
> > 
> > 	wakeup_flag = 1;
> > 	smp_mb();
> > 	wake_up_process(task);
> > 
> 
> No, it doesn't need to do that. try_to_wake_up() does the right thing.
> 
> It does:
> 
> 	smp_mb__before_spinlock();
> 	raw_spin_lock_irqsave(&p->pi_lock);
> 
> Now, smp_mb__before_spinlock() is a bit of an odd duck, if you look at
> its comment it says:
> 
> /*
>  * Despite its name it doesn't necessarily has to be a full barrier.
>  * It should only guarantee that a STORE before the critical section
>  * can not be reordered with LOADs and STOREs inside this section.
>  * spin_lock() is the one-way barrier, this LOAD can not escape out
>  * of the region. So the default implementation simply ensures that
>  * a STORE can not move into the critical section, smp_wmb() should
>  * serialize it with another STORE done by spin_lock().
>  */
> #ifndef smp_mb__before_spinlock
> #define smp_mb__before_spinlock()	smp_wmb()
> #endif

I see.  So the kerneldoc for wake_up_process() is misleading at best.

> So per default it ends up being:
> 
> 	WMB
> 	LOCK
> 
> Which is sufficient to order the prior store vs the later load as is
> required. Note that a spinlock acquire _must_ imply a store (we need to
> mark the lock as taken), therefore the prior store is ordered against
> the lock store per the wmb, and since the lock must imply an ACQUIRE
> that limits the load.

Actually, that's not entirely true (although presumably it works okay
for most architectures).  In theory, the first write and the WMB can
move down after the lock's ACQUIRE.  Then the later read could move
earlier, before the lock's store, the WMB, and the first write, but
still after the ACQUIRE.  Thus the later read could be reordered with
the first write.

In other words, these actions:

	store wakeup_flag
	WMB
	load-ACQUIRE lock
	store lock
	read task->state

can be reordered to:

	load-ACQUIRE lock
	store wakeup_flag
	WMB
	store lock
	read task->state

and then to:

	load-ACQUIRE lock
	read task->state
	store wakeup_flag
	WMB
	store lock

without violating any ordering rules.

> Now, PowerPC defines smp_mb__before_spinlock as smp_mb(), and this is
> because PowerPC ACQUIRE is a bit of an exception, if you want more
> details I'm sure I or Paul can dredge them up :-)

PPC can't do the reordering described above, but it does have other
weaknesses.  Without smp_mb(), the write to wakeup_flag doesn't have to
propagate from one CPU to the other before the second CPU tries to read
it.

Felipe, your tests will show whether my guess was totally off-base.  
For the new people, Felipe is tracking down a problem that involves
exactly the code sequence listed at the top of the email, where we know
that the wakeup routine runs but nevertheless the task sleeps.  At
least, that's what it looks like at the moment.

Alan Stern