linux-kernel - Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <336901377100289@web16f.yandex.ru>
Date:	Wed, 21 Aug 2013 19:51:29 +0400
From:	Alexander Fyodorov <halcy@...dex.ru>
To:	Waiman Long <waiman.long@...com>
Cc:	linux-kernel <linux-kernel@...r.kernel.org>,
	"Chandramouleeswaran, Aswin" <aswin@...com>,
	"Norton, Scott J" <scott.norton@...com>,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

21.08.2013, 07:01, "Waiman Long" <waiman.long@...com>:
> On 08/20/2013 11:31 AM, Alexander Fyodorov wrote:
>> Isn't a race possible if another thread acquires the spinlock in the window 
>> between setting lock->locked to 0 and issuing smp_wmb()? Writes from
>> the critical section from our thread might be delayed behind the write to
>> lock->locked if the corresponding cache bank is busy.
>
> The purpose of smp_wmb() is to make sure that content in the cache will
> be flushed out to the memory in case the cache coherency protocol cannot
> guarantee a single view of memory content on all processor.

Linux kernel does not support architectures without cache coherency, and while using memory barriers just for flushing write buffers ASAP on cache-coherent processors might benefit performance on some architectures it will hurt performance on others. So it must not be done in architecture-independent code.

> In other
> word, smp_wmb() is used to make other processors see that the lock has
> been freed ASAP. If another processor see that before smp_wmb(), it will
> be even better as the latency is reduced. As the lock holder is the only
> one that can release the lock, there is no race condition here.

No, I was talking about the window between freeing lock and issuing smp_wmb(). What I meant is:
1) A = 0
2) CPU0 locks the spinlock protecting A.
3) CPU0 writes 1 to A, but the write gets stalled because the corresponding cache bank is busy.
4) CPU0 calls spin_unlock() and sets lock->locked to 0.

If CPU1 does a spin_lock() right now, it will succeed (since lock->locked == 0). But the write to A hasn't reached cache yet, so CPU1 will see A == 0.

More examples on this are in Documentation/memory-barriers.txt

> That is a legitimate question. I don't think it is a problem on x86 as
> the x86 spinlock code doesn't do a full mb() in the lock and unlock
> paths.

It does because "lock" prefix implies a full memory barrier.

> The smp_mb() will be conditionalized depending on the ARCH_QSPINLOCK
> setting. The smp_wmb() may not be needed, but a compiler barrier should
> still be there.

Do you mean because of inline? That shouldn't be a problem because smp_mb() prohibits compiler from doing any optimizations across the barrier (thanks to the "volatile" keyword).

More on this in Documentation/memory-barriers.txt

>>  Also I don't understand why there are so many uses of ACCESS_ONCE()
>> macro. It does not guarantee memory ordering with regard to other CPUs,
>> so probably most of the uses can be removed (with exception of
>> lock_is_contended(), where it prohibits gcc from optimizing the loop away).
>
> All the lock/unlock code can be inlined and we don't know what the
> compiler will do to optimize code. The ACCESS_ONCE() macro is used to
> make sure that the compiler won't optimize away the actual fetch or
> write of the memory. Even if the compiler won't optimize away the memory
> access, adding the ACCESS_ONCE() macro won't have any extra overhead. So
> a more liberal use of it won't hurt performance.

If compiler optimized memory access away it would be a bug. And I'm not so sure about overhead... For example, on some VLIW architectures ACCESS_ONCE() might prohibit compiler from mixing other instructions to the unlock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/