linux-kernel - Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180927074748.GA7939@andrea>
Date:   Thu, 27 Sep 2018 09:47:48 +0200
From:   Andrea Parri <andrea.parri@...rulasolutions.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     will.deacon@....com, mingo@...nel.org,
        linux-kernel@...r.kernel.org, longman@...hat.com,
        tglx@...utronix.de
Subject: Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > > 
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > > 
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > > 
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >    and set tail; since we set tail while pending is set, the ordering
> > >    is split is not important (and not fundamentally different form
> > >    fetch_or). [*]
> > > 
> > >  - when the last queued lock holder acquires the lock (uncontended),
> > >    we clear the tail and set the lock byte. By first setting the
> > >    pending bit this cmpxchg will fail and the later load must then
> > >    see the remaining tail.
> > > 
> > > Another interesting scenario is where there are only 2 threads:
> > > 
> > > 	lock := (0,0,0)
> > > 
> > > 	CPU 0			CPU 1
> > > 
> > > 	lock()			lock()
> > > 	  trylock(-> 0,0,1)       trylock() /* fail */
> > > 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> > > 				  mb()
> > > 				  val = smp_load_acquire(*lock);
> > > 
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> > 
> > If this were true, we would have a violation of "coherence":
> 
> The thing is, this is mixed size, see:

The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


> 
>   https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> 
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
> 
> 	old = xchg(pending,1);
> 	val = smp_load_acquire(*lock);
> 
> to be re-ordered like:
> 
> 	val = smp_load_acquire(*lock);
> 	old = xchg(pending, 1);
> 
> with the exception that it will fwd the pending byte into the later
> load, so we get:
> 
> 	val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
> 
> for 'free'.
> 
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

  Andrea


> 
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.