lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 01 Nov 2007 08:17:58 +0100
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Christoph Lameter <clameter@....com>
CC:	akpm@...ux-foundation.org, linux-arch@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	Pekka Enberg <penberg@...helsinki.fi>
Subject: Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu
 access overhead

Christoph Lameter a écrit :
> This patch increases the speed of the SLUB fastpath by
> improving the per cpu allocator and makes it usable for SLUB.
> 
> Currently allocpercpu manages arrays of pointer to per cpu objects.
> This means that is has to allocate the arrays and then populate them
> as needed with objects. Although these objects are called per cpu
> objects they cannot be handled in the same way as per cpu objects
> by adding the per cpu offset of the respective cpu.
> 
> The patch here changes that. We create a small memory pool in the
> percpu area and allocate from there if alloc per cpu is called.
> As a result we do not need the per cpu pointer arrays for each
> object. This reduces memory usage and also the cache foot print
> of allocpercpu users. Also the per cpu objects for a single processor
> are tightly packed next to each other decreasing cache footprint
> even further and making it possible to access multiple objects
> in the same cacheline.
> 
> SLUB has the same mechanism implemented. After fixing up the
> alloccpu stuff we throw the SLUB method out and use the new
> allocpercpu handling. Then we optimize allocpercpu addressing
> by adding a new function
> 
> 	this_cpu_ptr()
> 
> that allows the determination of the per cpu pointer for the
> current processor in an more efficient way on many platforms.
> 
> This increases the speed of SLUB (and likely other kernel subsystems
> that benefit from the allocpercpu enhancements):
> 
> 
>        SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
>    8    96      86      45      44      38	3 *
>   16    84      92      49      48      43	2 *
>   32    84      106     61      59      53	+++
>   64    102     129     82      88      75	++
>  128    147     226     188     181     176	-
>  256    200     248     207     285     204	=
>  512    300     301     260     209     250	+
> 1024    416     440     398     264     391	++
> 2048    720     542     530     390     511	+++
> 4096    1254    342     342     336     376	3 *
> 
> alloc/free test
>       SLAB    SLUB    SLUB+   SLUB-o	SLUB-a
>       137-146 151     68-72   68-74	56-58	3 *
> 
> Note: The per cpu optimization are only half way there because of the screwed
> up way that x86_64 handles its cpu area that causes addditional cycles to be
> spend by retrieving a pointer from memory and adding it to the address.
> The i386 code is much less cycle intensive being able to get to per cpu
> data using a segment prefix and if we can get that to work on x86_64
> then we may be able to get the cycle count for the fastpath down to 20-30
> cycles.
> 

Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is 
not enough because many things will use per_cpu if only per_cpu was reasonably 
fast (ie not so many dereferences)

I think this question already came in the past and Linus already answered it, 
but I again ask it. What about VM games with modern cpus (64 bits arches)

Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout 
so that each cpu maps its own per_cpu area on this area, so that the local 
per_cpu data sits in the same virtual address on each cpu. Then we dont need a 
segment prefix nor adding a 'per_cpu offset'. No need to write special asm 
functions to read/write/increment a per_cpu data and gcc could use normal 
rules for optimizations.

We only would need adding "per_cpu offset" to get data for a given cpu.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ