linux-kernel - Re: Boot failure with next-20120208

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F396FA9.90606@linux.intel.com>
Date:	Mon, 13 Feb 2012 12:16:41 -0800
From:	Arjan van de Ven <arjan@...ux.intel.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
CC:	Michael Neuling <mikey@...ling.org>,
	Stephen Rothwell <sfr@...b.auug.org.au>,
	LKML <linux-kernel@...r.kernel.org>, gregkh@...uxfoundation.org,
	linux-next@...r.kernel.org,
	ppc-dev <linuxppc-dev@...ts.ozlabs.org>,
	Milton Miller <miltonm@....com>
Subject: Re: Boot failure with next-20120208

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/13/2012 12:05 PM, Andrew Morton wrote:
> On Mon, 13 Feb 2012 06:18:34 -0800 Arjan van de Ven
> <arjan@...ux.intel.com> wrote:
> 
>> On 2/12/2012 7:04 PM, Michael Neuling wrote:
>>>> Just a quick note to say I got a boot OOPs with next-20120208
>>>> and 9 on a Power7 blade (my other PowerPC boot tests are ok.
>>>> I'll investigate this further on Monday.
> 
> Thanks for testing linux-next.  Very useful.
> 
>>>> The line referenced below is:
>>>> 
>>>> BUG_ON(!kobj || !kobj->sd || !attr);
>>>> 
>>>> in sysfs_create_file().
> 
> Yes, this is exactly why we should never use BUG_ON(a || b).  We
> don't know which of those three expressions triggered.
> 
>>>> calling  .topology_init+0x0/0x1ac @ 1 initcall
>>>> 7_.async_cpu_up+0x0/0x40 returned 0 after 9765 usecs 
>>>> async_continuing @ 20 after 9765 usec ------------[ cut here
>>>> ]------------ kernel BUG at fs/sysfs/file.c:573! Oops:
>>>> Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=32 NUMA
>>>> pSeries Modules linked in: NIP: c00000000024a35c LR:
>>>> c0000000004ee050 CTR: c00000000083ca24 REGS: c0000003fd9e7560
>>>> TRAP: 0700   Not tainted  (3.3.0-rc2-autokern1) MSR:
>>>> 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 88002082  XER:
>>>> 0000000f CFAR: c00000000024a370 TASK = c0000003fd9e8000[20]
>>>> 'kworker/u:6' THREAD: c0000003fd9e4000 CPU: 0 GPR00:
>>>> 0000000000000001 c0000003fd9e77e0 c000000000d19bb8
>>>> 0000000000000000 GPR04: c000000000bf37a8 0000000000000008
>>>> 8000000002096400 0000000000000000 GPR08: 0000000000000000
>>>> c000000000f80028 c000000000d52bd8 0000000000000000 GPR12:
>>>> 0000000048002088 c00000000f33b000 0000000001affa78
>>>> 00000000009aa000 GPR16: 0000000000e1f3c8 0000000002d517f0
>>>> 0000000001aff984 0000000000000060 GPR20: 0000000000000000
>>>> ffffffffffffffff 0000000000000000 c000000000c45128 GPR24:
>>>> 0000000000000000 0000000000000008 0000000000000000
>>>> c000000000c44200 GPR28: c000000000f80028 0000000000000008
>>>> c000000000c85038 0000000000000002 NIP [c00000000024a35c]
>>>> .sysfs_create_file+0x1c/0x40 LR [c0000000004ee050]
>>>> .device_create_file+0x20/0x40 Call Trace: [c0000003fd9e77e0]
>>>> [c0000003fd9e78a0] 0xc0000003fd9e78a0 (unreliable) 
>>>> [c0000003fd9e7850] [c00000000083c9a4]
>>>> .register_cpu_online+0x1d0/0x250 [c0000003fd9e7900]
>>>> [c00000000083ca8c] .sysfs_cpu_notify+0x68/0x28c 
>>>> [c0000003fd9e79b0] [c00000000083769c]
>>>> .notifier_call_chain+0x9c/0x100 [c0000003fd9e7a50]
>>>> [c0000000000a5878] .__cpu_notify+0x38/0x80 [c0000003fd9e7ad0]
>>>> [c00000000083e124] ._cpu_up+0x10c/0x178 [c0000003fd9e7b90]
>>>> [c00000000083e2c8] .cpu_up+0x138/0x164 [c0000003fd9e7c20]
>>>> [c000000000ba46d0] .async_cpu_up+0x28/0x40 [c0000003fd9e7ca0]
>>>> [c0000000000d81ec] .async_run_entry_fn+0xbc/0x1f0 
>>>> [c0000003fd9e7d50] [c0000000000c7cbc]
>>>> .process_one_work+0x19c/0x590 [c0000003fd9e7e10]
>>>> [c0000000000c8618] .worker_thread+0x188/0x4b0 
>>>> [c0000003fd9e7ed0] [c0000000000ce57c] .kthread+0xbc/0xd0 
>>>> [c0000003fd9e7f90] [c000000000021448]
>>>> .kernel_thread+0x54/0x70 Instruction dump: 7fa3eb78 ebe1fff8
>>>> eba1ffe8 7c0803a6 4e800020 2c230000 41820024 e8630030 
>>>> 7c800074 7800d182 2fa30000 419e0014 <0b000000> 38a00002
>>>> 4bfffebc e8630030 ---[ end trace 31fd0ba7d8756001 ]--- 
>>>> initcall .topology_init+0x0/0x1ac returned 0 after 0 usecs 
>>>> calling  .pcibios_init+0x0/0xe8 @ 1 PCI: Probing PCI
>>>> hardware PCI: Probing PCI hardware done initcall
>>>> .pcibios_init+0x0/0xe8 returned 0 after 0 usecs calling
>>>> .add_system_ram_resources+0x0/0x140 @ 1 initcall
>>>> .add_system_ram_resources+0x0/0x140 returned 0 after 0 usecs 
>>>> calling
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> @ 1 initcall
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> returned 0 after 0 usecs calling  .opal_init+0x0/0x1cc @ 1 
>>>> opal: Node not found initcall .opal_init+0x0/0x1cc returned
>>>> -19 after 0 usecs calling
>>>> .__machine_initcall_pseries_ioei_init+0x0/0xa0 @ 1
>>> 
>>> Reverting "smp: start up non-boot CPUs asynchronously"
>>> (8de7a96405 from next-20120208) fixes this problem for me.
>>> 
>> if that fixes it, it means PPC has a race somewhere in the cpu
>> hotplug code, since all the patch does is hotplug the cpus one by
>> one (which the normal kernel also does, just not in parallel with
>> other work)
> 
> The bug looks pretty generic, nothing very PPC-specific there.  It 
> might affect other architectures - we won't know until we find out
> wht caused it.

well one half of the race looks pretty generic...
..... doesn't mean the other half of the race is though....


> 
> Ho hum, I suppose I should pull the patch out of linux-next, to
> avoid disrupting other testing.  This means it's going to be hard
> to get the bug fixed.

it means losing this one big PPC machine indeed.... until they hit
that same race some other way with regular real cpu hotplug ;-(
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJPOW+pAAoJEEHdSxh4DVnEJXEIAKHFQwrGw1R66bEjTMxfgOpH
ZlUwjciyNlo2KpqgaUgulHDHNzWQa29nzjPfuk6qG7mHGKnbgyn/IBCnUD5uJqri
6yu7Md0vekwJnoilZQvEuvF6qHrOYOcaWvW60x2y3W+fBesa5zxpqwDLbKj4Qvu3
YAbxeMaAr0W/d7pKEubKds3YJnr1S06qbK8Jw7DF92YEd7xDTBTSpuSF3vA6BlLQ
jEV1xwxVwnLEa6A/DnkvQ67Kayj0zfC2CSCqlt2T2BiDl81XkCPC/U3yHuHb/ISI
ykwFHy4L2VH54n5dQH8S6qfzNS1vh8IQ0xzVtJH6CHh8XXK7T97T07jZZ/LWMWI=
=GQrj
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/