linux-kernel - Re: [PATCH] SLUB use cmpxchg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070821233938.GD29691@Krystal>
Date:	Tue, 21 Aug 2007 19:39:38 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Christoph Lameter <clameter@....com>
Cc:	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	mingo@...hat.com
Subject: Re: [PATCH] SLUB use cmpxchg_local

* Christoph Lameter (clameter@....com) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > SLUB Use cmpxchg() everywhere.
> > 
> > It applies to "SLUB: Single atomic instruction alloc/free using
> > cmpxchg".
> 
> > +++ slab/mm/slub.c	2007-08-20 18:42:28.000000000 -0400
> > @@ -1682,7 +1682,7 @@ redo:
> >  
> >  	object[c->offset] = freelist;
> >  
> > -	if (unlikely(cmpxchg_local(&c->freelist, freelist, object) != freelist))
> > +	if (unlikely(cmpxchg(&c->freelist, freelist, object) != freelist))
> >  		goto redo;
> >  	return;
> >  slow:
> 
> Ok so regular cmpxchg, no cmpxchg_local. cmpxchg_local does not bring 
> anything more? My measurements did not show any difference. I measured on 
> Athlon64. What processor is being used?
> 

This patch only cleans up the tree before proposing my cmpxchg_local
changes. There was an inconsistent use of cmpxchg/cmpxchg_local there.

Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
for the kmalloc/kfree pair (test 2).

Pros :
- we can use barrier() instead of rmb()
- cmpxchg_local is faster

Con :
- we must disable preemption

I use a 3GHz Pentium 4 for my tests.

Results (compared to cmpxchg_local numbers) :

SLUB Performance testing
========================
1. Kmalloc: Repeatedly allocate then free test
(kfree here is slow path)

* cmpxchg
kmalloc(8) = 271 cycles	    kfree = 645 cycles
kmalloc(16) = 158 cycles	  kfree = 428 cycles
kmalloc(32) = 153 cycles	  kfree = 446 cycles
kmalloc(64) = 178 cycles	  kfree = 459 cycles
kmalloc(128) = 247 cycles	  kfree = 481 cycles
kmalloc(256) = 363 cycles	  kfree = 605 cycles
kmalloc(512) = 449 cycles	  kfree = 677 cycles
kmalloc(1024) = 626 cycles	kfree = 810 cycles
kmalloc(2048) = 681 cycles	kfree = 869 cycles
kmalloc(4096) = 471 cycles	kfree = 575 cycles
kmalloc(8192) = 666 cycles	kfree = 747 cycles
kmalloc(16384) = 736 cycles	kfree = 853 cycles

* cmpxchg_local
kmalloc(8) = 83 cycles      kfree = 363 cycles
kmalloc(16) = 85 cycles     kfree = 372 cycles
kmalloc(32) = 92 cycles     kfree = 377 cycles
kmalloc(64) = 115 cycles    kfree = 397 cycles
kmalloc(128) = 179 cycles   kfree = 438 cycles
kmalloc(256) = 314 cycles   kfree = 564 cycles
kmalloc(512) = 398 cycles   kfree = 615 cycles
kmalloc(1024) = 573 cycles  kfree = 745 cycles
kmalloc(2048) = 629 cycles  kfree = 816 cycles
kmalloc(4096) = 473 cycles  kfree = 548 cycles
kmalloc(8192) = 659 cycles  kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles


2. Kmalloc: alloc/free test

*cmpxchg
kmalloc(8)/kfree = 321 cycles
kmalloc(16)/kfree = 308 cycles
kmalloc(32)/kfree = 311 cycles
kmalloc(64)/kfree = 310 cycles
kmalloc(128)/kfree = 306 cycles
kmalloc(256)/kfree = 325 cycles
kmalloc(512)/kfree = 324 cycles
kmalloc(1024)/kfree = 322 cycles
kmalloc(2048)/kfree = 309 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1027 cycles
kmalloc(16384)/kfree = 1204 cycles

* cmpxchg_local
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/