linux-kernel - Re: [PATCH] mm: percpu: Add PCPU_FC_FIXED to pcpu_fc for setting fixed pcpu_atom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1335405672.14538.135.camel@ymzhang.sh.intel.com>
Date:	Thu, 26 Apr 2012 10:01:12 +0800
From:	Yanmin Zhang <yanmin_zhang@...ux.intel.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	ShuoX Liu <shuox.liu@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm: percpu: Add PCPU_FC_FIXED to pcpu_fc for setting
 fixed pcpu_atom_size.

On Wed, 2012-04-25 at 15:24 -0700, Tejun Heo wrote:
> Hello, ShouX.
> 
> On Wed, Apr 25, 2012 at 04:49:28PM +0800, ShuoX Liu wrote:
> > From: ShuoX Liu <shuox.liu@...el.com>
> > 
> > We are enabling Android on Medfield. On i386, if the board has more
> > physical memory, it means the vmalloc space is small. If the vmalloc
> > space is big, it means physical memory is small. Dynamic percpu
> > allocation is based on VM space. On i386, by default, the chunk size
> > is 4MB. As vmalloc space is <= 128M, percpu allocation often fails.
> 
> Can you please provide more details - which kernel/user split was
> used, how much memory was there and so on?  Also, can you please
> attach boot log and the output of "cat /proc/vmallocinfo"?
We are enabling Android on Medfield which is an embedded i386 platform.
kernel/user split is 1/3G. Physical memory is 1GB and we turn on HIGHMEM.

The boot log is a little big. I copied the percpu part for your reference.

04-26 09:57:32.987     0     0 W Kernel-Dmesg: <6>[    0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <7>[    0.000000] nr_irqs_gsi: 85
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <6>[    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:bec00000)
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <6>[    0.000000] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <6>[    0.000000] PERCPU: Embedded 12 pages/cpu @f6400000 s25280 r0 d23872 u2097152
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <7>[    0.000000] pcpu-alloc: s25280 r0 d23872 u2097152 alloc=1*4194304
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <7>[    0.000000] pcpu-alloc: [0] 0 1
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <4>[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 252826
04-26 09:57:32.987     0     0 W Kernel-Dmesg: <5>[    0.000000] Kernel command line: init=/init pci=noearly console=tty
04-26 09:57:32.988     0     0 W Kernel-Dmesg: MFD3 console=logk0 earlyprintk=nologger loglevel=7 hsu_dma=7 kmemleak=off androidboot.bootmedia=sdcard androidboot.hardware=mfld_pr2 ip=50.0.0.2:50.0.0.1::255.255.255.0::usb0:on g_android.fastboot=1 droidboot.scratch=100 androidboot.wakesrc=05 androidboot.mode=fastboot
04-26 09:57:32.988     0     0 W Kernel-Dmesg: <6>[    0.000000] PID hash table entries: 4096 (order: 2, 16384 bytes)
04-26 09:57:32.988     0     0 W Kernel-Dmesg: <6>[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 b


We hit below issue.

03-13 08:22:36.273 11151 11133 I KERNEL : [74395.013328] PERCPU: allocation
failed, size=252 align=4, failed to allocate new chunk
03-13 08:22:36.274 11151 11133 I KERNEL : [74395.014062] Pid: 11151, comm: rild
Not tainted 3.0.8-137162-g5e405a0 #1
03-13 08:22:36.275 11151 11133 I KERNEL : [74395.014656] Call Trace:
03-13 08:22:36.283 11151 11133 I KERNEL : [74395.022691] [<c1306eaa>]
pcpu_alloc+0x12ca/0x1300
03-13 08:22:36.288 11151 11133 I KERNEL : [74395.027998] [<c125c1a9>] ?
sysctl_set_parent+0x29/0x40
03-13 08:22:36.293 11151 11133 I KERNEL : [74395.033335] [<c1306f0f>]
__alloc_percpu+0xf/0x20
03-13 08:22:36.298 11151 11133 I KERNEL : [74395.038144] [<c180dcad>]
snmp_mib_init+0x3d/0x70
03-13 08:22:36.302 11151 11133 I KERNEL : [74395.043221] [<c1846efa>]
ipv6_add_dev+0xfa/0x380


We run many stress testing and hit the percpu allocation failure at other
places.

We googled the error and found other guys in community also hit the bug.
You suggested them to use percpu_alloc=page. We tried it and it hurts power.

vmallocinfo is attached. From the vmallocinfo, we could find the VM space
is fragmented. We would write another patch to clean it up.

> 
> > If using PERCPU_FC_PAGE, system can't go to deep sleep states.
> 
> Why?
Medfield has 2 cpu threads. Only when all the 2 threads enter deep C states,
for example, C6, the core would enter C6. If booting kernel with percpu_alloc=page,
cpu core often aborts the C6 entering. We don't know why. C6 is aborted under
many conditions. One is when there is pending interrupt. I suspect with page size
alloc, it might trigger more cache miss. Just before calls mwait to enter
C6, we record some statistics data and that might trigger the cache miss
to abort the C6. It's just a _GUESS_.

We tried atom_size with 32k, 128k, 256k. There is no power regression.

> 
> > diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
> > index 71f4727..824bc41 100644
> > --- a/arch/x86/kernel/setup_percpu.c
> > +++ b/arch/x86/kernel/setup_percpu.c
> > @@ -185,9 +185,13 @@ void __init setup_per_cpu_areas(void)
> >  #endif
> >  	rc = -EINVAL;
> >  	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
> > -		const size_t atom_size = cpu_has_pse ? PMD_SIZE : PAGE_SIZE;
> > +		size_t atom_size;
> >  		const size_t dyn_size = PERCPU_MODULE_RESERVE +
> >  			PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
> > +		if (pcpu_chosen_fc == PCPU_FC_FIXED && pcpu_atom_size)
> > +			atom_size = pcpu_atom_size;
> > +		else
> > +			atom_size = cpu_has_pse ? PMD_SIZE : PAGE_SIZE;
> >  
> >  		rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
> >  					    dyn_size, atom_size,
> 
> Umm... this is way too hacky.  atom_size can't be an arbitrary value
The interface is give admin more options. It doesn't mean admin could
choose any value.

> and the param is meaningful only to x86 yet defined globally. 
Right, from real usage requirement point of view. If we googled the failure
info dumped from kernel, we could see many other guys also hit it. That's why
we send the patch to LKML.

>  Also,
> while atom_size has effect on vmalloc area usage, way more important
> factor is distance between units.
Could you elaborate it?

>   What we probably need to do is
> tighten the rejection criteria of pcpu_embed_first_chunk()
Could you explain it more?

>  and fix
> whatever problem FC_PAGE is causing.
We can't fix FC_PAGE power regression. If we do so, we need contact many
hardware architects. Current kernel supports FC_PAGE and PMD_SIZE, why
not to allow admin to choose other values?

Thanks for the comments.

Yanmin


Download attachment "vmallocinfo.tgz" of type "application/x-compressed-tar" (5618 bytes)