[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151112070915.GC6314@fixme-laptop.cn.ibm.com>
Date: Thu, 12 Nov 2015 15:14:51 +0800
From: Boqun Feng <boqun.feng@...il.com>
To: Oleg Nesterov <oleg@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
linux-kernel@...r.kernel.org, paulmck@...ux.vnet.ibm.com,
corbet@....net, mhocko@...nel.org, dhowells@...hat.com,
torvalds@...ux-foundation.org, will.deacon@....com,
Michael Ellerman <mpe@...erman.id.au>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Paul Mackerras <paulus@...ba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
>
> Yes, me too.
>
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
>
> Consider this code:
>
> object_t *object;
> spinlock_t lock;
>
> void update(void)
> {
> object_t *o;
>
> spin_lock(&lock);
> o = READ_ONCE(object);
> if (o) {
> BUG_ON(o->dead);
> do_something(o);
> }
> spin_unlock(&lock);
> }
>
> void destroy(void) // can be called only once, can't race with itself
> {
> object_t *o;
>
> o = object;
> object = NULL;
>
> /*
> * pairs with lock/ACQUIRE. The next update() must see
> * object == NULL after spin_lock();
> */
> smp_mb();
>
> spin_unlock_wait(&lock);
>
> /*
> * pairs with unlock/RELEASE. The previous update() has
> * already passed BUG_ON(o->dead).
> *
> * (Yes, yes, in this particular case it is not needed,
> * we can rely on the control dependency).
> */
> smp_mb();
>
> o->dead = true;
> }
>
> I believe the code above is correct and it needs the barriers on both sides.
>
Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)
spin_lock(&lock):
r1 = *lock; // LL, r1 == 0
o = READ_ONCE(object); // could be reordered here.
*lock = 1; // SC
This could happen because of the ACQUIRE semantics of spin_lock(), and
the current implementation of spin_lock() on PPC allows this happen.
(Cc PPC maintainers for their opinions on this one)
Therefore the case below can happen:
CPU 1 CPU 2 CPU 3
================== ==================== ==============
spin_unlock(&lock);
spin_lock(&lock):
r1 = *lock; // r1 == 0;
o = READ_ONCE(object); // reordered here
object = NULL;
smp_mb();
spin_unlock_wait(&lock);
*lock = 1;
smp_mb();
o->dead = true;
if (o) // true
BUG_ON(o->dead); // true!!
To show this, I also translate this situation into a PPC litmus for
herd[1]:
PPC spin-lock-wait
"
r1: local variable of lock
r2: constant 1
r3: constant 0 or NULL
r4: local variable of object, i.e. o
r5: local variable of *o (simulate ->dead as I didn't know
how to write fields of structure in herd ;-()
r13: the address of lock, i.e. &lock
r14: the address of object, i.e. &object
"
{
0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
2:r1=0;2:r13=lock;
lock=1; object=old; old=0;
}
P0 | P1 | P2 ;
ld r4,0(r14) | Lock: | stw r1,0(r13);
std r3,0(r14) | lwarx r1,r3,r13 | ;
| cmpwi r1,0 | ;
sync | bne Lock | ;
| stwcx. r2,r3,r13 | ;
Wait: | bne Lock | ;
lwz r1,0(r13) | lwsync | ;
cmpwi r1,0 | ld r4,0(r14) | ;
bne Wait | cmpwi r4,0 | ;
| beq Fail | ;
sync | lwz r5, 0(r4) | ;
stw r2,0(r4) | Fail: | ;
| lwsync | ;
| stw r3, 0(r13) | ;
exists
(1:r4=old /\ 1:r5=1)
,whose result says that (1:r4=old /\ 1:r5=1) can happen:
Test spin-lock-wait Allowed
States 3
1:r4=0; 1:r5=0;
1:r4=old; 1:r5=0;
1:r4=old; 1:r5=1;
Loop Ok
Witnesses
Positive: 18 Negative: 108
Condition exists (1:r4=old /\ 1:r5=1)
Observation spin-lock-wait Sometimes 18 108
Hash=244f8c0f91df5a5ed985500ed7230272
Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)
IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.
So.. we probably need to take the lock here.
> If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
> should see the result of our cmpxchg(), it must not try to read work->next or
> work->func.
>
> Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
> I mean that unlock_wait() should be removed because it is too subtle for me ;)
>
;-)
I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:
/*
* The setting of TASK_RUNNING by try_to_wake_up() may be delayed
* when the following two conditions become true.
* - There is race condition of mmap_sem (It is acquired by
* exit_mm()), and
* - SMI occurs before setting TASK_RUNINNG.
* (or hypervisor of virtual machine switches to other guest)
* As a result, we may become TASK_RUNNING after becoming TASK_DEAD
*
* To avoid it, we have to wait for releasing tsk->pi_lock which
* is held by try_to_wake_up()
*/
smp_mb();
raw_spin_unlock_wait(&tsk->pi_lock);
/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;
Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html
Regards,
Boqun
Download attachment "signature.asc" of type "application/pgp-signature" (474 bytes)
Powered by blists - more mailing lists