linux-kernel - Re: Memory barrier needed with wake_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 6 Sep 2016 13:36:05 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Alan Stern <stern@...land.harvard.edu>
Cc:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Ingo Molnar <mingo@...hat.com>,
        Felipe Balbi <felipe.balbi@...ux.intel.com>,
        USB list <linux-usb@...r.kernel.org>,
        Kernel development list <linux-kernel@...r.kernel.org>,
        Will Deacon <will.deacon@....com>
Subject: Re: Memory barrier needed with wake_up_process()?

On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
> 
> > > Actually, seeing it written out like this, one realizes that it really 
> > > ought to be:
> > > 
> > > 	CPU 0				CPU 1
> > >         -----				-----
> > >         Start DMA			Handle DMA-complete irq
> > >         Sleep until bh->state		smp_mb()
> > > 					set bh->state
> > > 					Wake up CPU 0
> > > 	smp_mb()
> > > 	Compute rc based on contents of the DMA buffer
> > > 
> > > (Bear in mind also that on some platforms, the I/O operation is carried 
> > > out by PIO rather than DMA.)

> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
> > Its only defined to do CPU/CPU interactions.
> 
> Suppose the DMA master finishes filling up an input buffer and issues a
> completion irq to CPU1.  Presumably the data would then be visible to
> CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
> before setting bh->state and waking up CPU0, and if CPU0 executes
> smp_mb after testing bh->state and before reading the data buffer,
> wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
> never did go to sleep?

Couple of notes here; I would expect the DMA master to make its stores
_globally_ visible on 'completion'. Because I'm not sure our smp_mb()
would help make it globally visible, since its only defined on CPU/CPU
interactions, not external.

Note that for example ARM has the distinction where smp_mb() uses
DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.

The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
OSH is Outer-SHarable and adds external agents like DMA (also includes
CPUs). The DSB-SY thing is even heavier and syncs world or something; I
always forget these details.

> Or would something more be needed?

The thing is, this is x86 (TSO). Most everything is globally
ordered/visible and full barriers.

The only reorder allowed by TSO is a later read can happen before a
prior store. That is, only:

	X = 1;
	smp_mb();
	r = Y;

actually needs a full barrier. All the other variants are no-ops.

Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
Which seems to imply DMA is coherent and globally visible.

> > I would very much expect the device IO stuff to order things for us in
> > this case. "Start DMA" should very much include sufficient fences to
> > ensure the data under DMA is visible to the DMA engine, this would very
> > much include things like flushing store buffers and maybe even writeback
> > caches, depending on platform needs.
> > 
> > At the same time, I would expect "Handle DMA-complete irq", even if it
> > were done with a PIO polling loop, to guarantee ordering against later
> > operations such that 'complete' really means that.
> 
> That's what I would expect too.
> 
> Back in the original email thread where the problem was first reported, 
> Felipe says that the problem appears to be something else.  Here's what 
> it looks like now, in schematic form:
> 
> 	CPU0
> 	----
> 	get_next_command():
> 	  while (bh->state != BUF_STATE_EMPTY)
> 		sleep_thread();
> 	  start an input request for bh
> 	  while (bh->state != BUF_STATE_FULL)
> 		sleep_thread();
> 
> As mentioned above, the input involves DMA and is terminated by an irq.
> The request's completion handler is bulk_out_complete():
> 
> 	CPU1
> 	----
> 	bulk_out_complete():
> 	  bh->state = BUF_STATE_FULL;
> 	  wakeup_thread();
> 
> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
> value different from BUF_STATE_FULL.  So it goes back to sleep again 
> and doesn't make any forward progress.
> 
> It's possible that something else is changing bh->state when it 
> shouldn't.  But if this were the explanation, why would Felipe see that 
> the problem goes away when he changes the memory barriers in 
> sleep_thread() and wakeup_thread() to smp_mb()?

Good question, let me stare at the code more.

Could you confirm that bulk_{in,out}_complete() work on different
usb_request structures, and they can not, at any time, get called on the
_same_ request?