lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070708043228.GB22397@wotan.suse.de>
Date:	Sun, 8 Jul 2007 06:32:28 +0200
From:	Nick Piggin <npiggin@...e.de>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: queued spinlock code and results

I made some tests of the queued spinlock code using userspace test code on
64-bit processors. I believe the xadd based code no longer has any theoretical
memory ordering problems.

The tests were done on 3 different architectures of different speeds and
vintages (so not really comparable between columns). I have attached the two
programs I used to make the results. Times are total elapsed time divided by
the number of locks (so effectively you get the time required for lock+unlock).

The threaded results also attempt to have an unfairness count, which is the
max number of times in a row that a lock is acquired,  when all other threads
are also executing in the loop -- the reason xadd for example is not always
0 there is because the other threads may not have reached the lock before
the current thread was able to get it several times (eg. if an interrupt
comes in, this could happen).

The threaded results are actually not too interesting because firstly they
don't really simulate the right mixture and timing of locks and critical
section time vs parallel time, and secondly the times are pretty heavily
skewed toward the unfair locks which tend to be _really_ unfair (eg. on the
last test, some threads were taking a full 2x longer to finish than others)
and so each thread gets to run for a long time without cache misses and thus
the overall throughput goes up. The dual-core core2 for another example did
regularly starve each thread for up to about 7*ms* (13 million cycles) while
the other was taking and releasing the lock.

However the non-threaded numbers (first 4 rows) are reasonably good. I found
on Core2 that if the inner loop was simply a lock+unlock and nothing else,
the inc-lock gained a bit more over the xadd-lock, but I don't think this is
representative of real code and it may just have been some alignment or
scheduling easter egg.

The results show that the xadd lock is definitely slower in single thread,
however the Core2 does really well, and also the cache cold case is pretty
insignificant.

It suggests to me that it *may* be worth trying a wholesale replacement of
spinlock with queued spinlock on x86 (starting as a config option which we
could perhaps default to ON in -mm / early -rcs). I think the main risk is
if there is some code that is really relying on the unfair batching nature
of the locks for good performance. I would argue that would usually be
poor code (eg. taking and dropping the spinlock too often, too much
contention on the lock, or using the wrong type of lock in the first place)  --
I think cacheline batching is great for regular lines, but not quite so nice
for locks. It would be interesting to see how something like the dcache lock
goes in dcache heavy workloads...


                         | Core2           Pentium 4 (Nocona)  Opteron
inc in cache             |  21.4ns           38.7ns              11.8ns
xadd in cache            |  22.5ns           44.4ns              14.5ns
                         |
inc no cache             | 122.2ns          153.5ns             172.7ns
xadd no cache            | 123.9ns          154.7ns             174.0ns
                         |
inc 2 thread (unfair)    | 230.9ns (69211)  376.5ns (3104)      174.0ns (2068) 
xadd 2 thread (unfair)   | 273.2ns (0)      307.1ns (59)        743.8ns (220)
                         |
inc 2 thread, other skt  |                  144.8ns (20444)     145.7ns (39739)
xadd 2 thread, other skt |                  306.8ns (147)       748.0ns (488)
                         |
inc 4 thread             |                 1248.9ns (172)      1158.5ns (139410)
xadd 4 thread            |                 1014.0ns (0)        2843.5ns (0)
                         |
inc 4 thread, diff skts  |                                     1598.0ns (65727)
xadd 4 thread, diff skts |                                     2848.1ns (0)
                         |
inc 16 thread            |                                     19017ns (55698)
xadd 16 thread           |                                     42723ns (0)



#define LOCK_INIT 1
static void lock(short *lock)
{
	__asm__ __volatile__ ("1:\n\t"
			      "lock ; decb %0\n\t"
			      "jns 2f\n\t"
			      "3:\n\t"
			      "rep ; nop\n\t"
			      "cmpb $0,%0\n\t"
			      "jle 3b\n\t"
			      "jmp 1b\n\t"
			      "2:\n\t"
			    : "+m" (*lock)
			    : : "memory");
}

static void unlock(short *lock)
{
	__asm__ __volatile__ ("movb $1,%0\n\t"
			    : "+m" (*lock)
			    : : "memory");
}

#define XLOCK_INIT 0
static void xlock(short *lock)
{
	short i = 0x0100;
	__asm__ __volatile__ ("lock ; xaddw %%ax, %1\n\t"
			      "1:\n\t"
			      "cmpb %%ah, %%al\n\t"
			      "je 2f\n\t"
			      "rep ; nop\n\t"
			      "movb %1, %%al\n\t"
			      "lfence\n\t"
			      "jmp 1b\n\t"
			      "2:\n\t"
			    : "+a" (i), "+m" (*lock)
			    : : "memory");
}

static void xunlock(short *lock)
{
	__asm__ __volatile__ ("incb %0\n\t"
			    : "+m" (*lock)
			    : : "memory");
}

---


View attachment "lock.c" of type "text/x-c++src" (3610 bytes)

View attachment "lock-threads.c" of type "text/x-c++src" (3910 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ