[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1234958676-27618-1-git-send-email-tj@kernel.org>
Date: Wed, 18 Feb 2009 21:04:26 +0900
From: Tejun Heo <tj@...nel.org>
To: rusty@...tcorp.com.au, tglx@...utronix.de, x86@...nel.org,
linux-kernel@...r.kernel.org, hpa@...or.com, jeremy@...p.org,
cpw@....com, mingo@...e.hu
Subject: [PATCHSET x86/core/percpu] implement dynamic percpu allocator
Hello, all.
This patchset implements dynamic percpu allocator. As I wrote before,
the percpu areas are organized in chunks which in turn are composed of
num_possible_cpus() units. As offsets of units against the first unit
stay the same regardless of where the chunk is, arch code can directly
access each percpu area by setting up percpu access such that each cpu
translates the same percpu address unit size apart.
Statically declared percpu area for the kernel which is setup early
during boot is also served by the same allocator but it needs special
init path as it needs to be up and running way before regular memory
management is initialized.
Percpu areas are allocated from the vmalloc space and managed directly
by the percpu code. Chunks start empty and are populated with pages
as they're allocated. As there are many small allocations and
allocations often need much smaller alignment (no need for cacheline
alignment), the allocator tries to maximize chunk utilization and put
allocations in fuller chunks.
There have been several concerns regarding this approach.
* On 64bit, no need for chunks. We can just allocate contiguous
areas.
For 32bit, with the overcrowded address space, consolidating percpu
allocations into vmalloc (or other) area is a big win as no space
needs to be further set aside for percpu variables and with
relatively small number of possible cpus, the chunks can be at
manageable size (e.g. 128k chunks for 4way smp wouldn't be too bad)
and it can achieve reasonable scalability.
So, I think the question becomes whether it makes sense to use
different allocation scheme for 32 and 64bits. The added overhead
of chunk handling itself isn't anything which can warrant separate
implementations. If there's a way to solve some other issues nicely
with larger address space, maybe, but I really think it would be
best to stick with one implementation.
* It adds to TLB pressure.
Yeah, unfortunately, it does. Currently it adds a number of kernel
4k pages into circulation (cold/high pages, so unlikely to affect
other large mappings). There are several different varieties of
this issue.
The unit size and thus the chunk size is pretty flexible (it
currently requires power of 2 but that restriction can be lifted
easily). With vm area allocation with larger alignment, using large
page for chunk (non-NUMA) or unit (large, large NUMA) shouldn't be
too difficult for highends but for mid range stuff, it looks like
there isn't much else to do than sticking with 4k mappings.
The TLB pressure problem would be there regardless of address layout
as long as we want to grow the percpu area dynamically.
Page-granual growth will add 4k pressures. Large-page-granuality is
likely to waste lots of space.
One trick we can do is to reserve the initial chunk in non-vmalloc
area so that at least the static cpu ones and whatever gets
allocated in the first chunk is served by regular large page
mappings. Given that those are most frequent visited ones, this
could be a nice compromise - no noticeable penalty for usual cases
yet allowing scalability for unusual cases. If this is something
which can be agreed on, I'll pursue this.
The percpu allocator is optional feature which can be selected by each
arch by setting HAVE_DYNAMIC_PER_CPU_AREA configuration variable.
Currently only x86_32 an 64 use it.
Ah.. I also left out cpu hotplugging stuff for now. This largely
isn't an issue on most machines where num_possible_cpus() doesn't
deviate much from num_online_cpus(). Are there cases where this is
critical? Currently, no user of percpu allocation, static or dynamic,
cares about this and it has been like this for a long time, so I'm a
little bit skeptical about it.
This patchset contains the following ten patches.
0001-vmalloc-call-flush_cache_vunmap-from-unmap_kernel.patch
0002-module-fix-out-of-range-memory-access.patch
0003-module-reorder-module-pcpu-related-functions.patch
0004-alloc_percpu-change-percpu_ptr-to-per_cpu_ptr.patch
0005-alloc_percpu-add-align-argument-to-__alloc_percpu.patch
0006-percpu-kill-percpu_alloc-and-friends.patch
0007-vmalloc-implement-vm_area_register_early.patch
0008-vmalloc-add-un-map_kernel_range_noflush.patch
0009-percpu-implement-new-dynamic-percpu-allocator.patch
0010-x86-convert-to-the-new-dynamic-percpu-allocator.patch
0001-0003 contain fixes and trivial prep. 0004-0006 clean up percpu.
0007-0008 add stuff to vmalloc which will be used by the new
allocator. 0009-0010 implement and use the new allocator.
This patchset is on top of the current x86/core/percpu[1] and can be
fetched from the following git vector.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu
diffstat follows.
arch/alpha/mm/init.c | 20
arch/x86/Kconfig | 3
arch/x86/include/asm/percpu.h | 8
arch/x86/include/asm/pgtable.h | 1
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 2
arch/x86/kernel/setup_percpu.c | 62 +-
arch/x86/mm/init_32.c | 10
arch/x86/mm/init_64.c | 19
block/blktrace.c | 2
drivers/acpi/processor_perflib.c | 4
include/linux/percpu.h | 65 +-
include/linux/vmalloc.h | 4
kernel/module.c | 78 +-
kernel/sched.c | 6
kernel/stop_machine.c | 2
mm/Makefile | 4
mm/allocpercpu.c | 32 -
mm/percpu.c | 876 +++++++++++++++++++++++++++++
mm/vmalloc.c | 84 ++
net/ipv4/af_inet.c | 4
20 files changed, 1183 insertions(+), 103 deletions(-)
Thanks.
--
tejun
[1] 58105ef1857112a186696c9b8957020090226a28
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists