[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090821111450.GA32037@elte.hu>
Date: Fri, 21 Aug 2009 13:14:50 +0200
From: Ingo Molnar <mingo@...e.hu>
To: linux-tip-commits@...r.kernel.org,
Arjan van de Ven <arjan@...radead.org>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Jones <davej@...hat.com>,
Kyle McMartin <kyle@...artin.ca>, Greg KH <gregkh@...e.de>
Cc: linux-kernel@...r.kernel.org, hpa@...or.com, mingo@...hat.com,
torvalds@...ux-foundation.org, catalin.marinas@....com,
a.p.zijlstra@...llo.nl, jens.axboe@...cle.com, fweisbec@...il.com,
stable@...nel.org, srostedt@...hat.com, tglx@...utronix.de
Subject: Re: [tip:tracing/urgent] tracing: Fix too large stack usage in
do_one_initcall()
* tip-bot for Ingo Molnar <mingo@...e.hu> wrote:
> Commit-ID: 4a683bf94b8a10e2bb0da07aec3ac0a55e5de61f
> Gitweb: http://git.kernel.org/tip/4a683bf94b8a10e2bb0da07aec3ac0a55e5de61f
> Author: Ingo Molnar <mingo@...e.hu>
> AuthorDate: Fri, 21 Aug 2009 12:53:36 +0200
> Committer: Ingo Molnar <mingo@...e.hu>
> CommitDate: Fri, 21 Aug 2009 13:03:22 +0200
>
> tracing: Fix too large stack usage in do_one_initcall()
>
> One of my testboxes triggered this nasty stack overflow crash
> during SCSI probing:
>
> [ 5.874004] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [ 5.875004] device: 'sda': device_add
> [ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
> [ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
> [ 5.878004] *pde = 00000000
> [ 5.878004] Thread overran stack, or stack corrupted
> [ 5.878004] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 5.878004] last sysfs file:
> [ 5.878004]
> [ 5.878004] Pid: 1, comm: swapper Not tainted (2.6.31-rc6-tip-01272-g9919e28-dirty #5685)
> [ 5.878004] EIP: 0060:[<b1008321>] EFLAGS: 00010083 CPU: 0
> [ 5.878004] EIP is at print_context_stack+0x81/0x110
> [ 5.878004] EAX: cf8a3000 EBX: cf8a3fe4 ECX: 00000049 EDX: 00000000
> [ 5.878004] ESI: b1cfce84 EDI: 00000000 EBP: cf8a3018 ESP: cf8a2ff4
> [ 5.878004] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [ 5.878004] Process swapper (pid: 1, ti=cf8a2000 task=cf8a8000 task.ti=cf8a3000)
> [ 5.878004] Stack:
> [ 5.878004] b1004867 fffff000 cf8a3ffc
> [ 5.878004] Call Trace:
> [ 5.878004] [<b1004867>] ? kernel_thread_helper+0x7/0x10
> [ 5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
> [ 5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
> [ 5.878004] *pde = 00000000
> [ 5.878004] Thread overran stack, or stack corrupted
> [ 5.878004] Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
>
> The oops did not reveal any more details about the real stack
> that we have and the system got into an infinite loop of
> recursive pagefaults.
>
> So i booted with CONFIG_STACK_TRACER=y and the 'stacktrace' boot
> parameter. The box did not crash (timings/conditions probably
> changed a tiny bit to trigger the catastrophic crash), but the
> /debug/tracing/stack_trace file was rather revealing:
>
> Depth Size Location (72 entries)
> ----- ---- --------
> 0) 3704 52 __change_page_attr+0xb8/0x290
> 1) 3652 24 __change_page_attr_set_clr+0x43/0x90
> 2) 3628 60 kernel_map_pages+0x108/0x120
> 3) 3568 40 prep_new_page+0x7d/0x130
> 4) 3528 84 get_page_from_freelist+0x106/0x420
> 5) 3444 116 __alloc_pages_nodemask+0xd7/0x550
> 6) 3328 36 allocate_slab+0xb1/0x100
> 7) 3292 36 new_slab+0x1c/0x160
> 8) 3256 36 __slab_alloc+0x133/0x2b0
> 9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0
> 10) 3216 108 create_object+0x28/0x250
> 11) 3108 40 kmemleak_alloc+0x81/0xc0
> 12) 3068 24 kmem_cache_alloc+0x162/0x1d0
> 13) 3044 52 scsi_pool_alloc_command+0x29/0x70
> 14) 2992 20 scsi_host_alloc_command+0x22/0x70
> 15) 2972 24 __scsi_get_command+0x1b/0x90
> 16) 2948 28 scsi_get_command+0x35/0x90
> 17) 2920 24 scsi_setup_blk_pc_cmnd+0xd4/0x100
> 18) 2896 128 sd_prep_fn+0x332/0xa70
> 19) 2768 36 blk_peek_request+0xe7/0x1d0
> 20) 2732 56 scsi_request_fn+0x54/0x520
> 21) 2676 12 __generic_unplug_device+0x2b/0x40
> 22) 2664 24 blk_execute_rq_nowait+0x59/0x80
> 23) 2640 172 blk_execute_rq+0x6b/0xb0
> 24) 2468 32 scsi_execute+0xe0/0x140
> 25) 2436 64 scsi_execute_req+0x152/0x160
> 26) 2372 60 scsi_vpd_inquiry+0x6c/0x90
> 27) 2312 44 scsi_get_vpd_page+0x112/0x160
> 28) 2268 52 sd_revalidate_disk+0x1df/0x320
> 29) 2216 92 rescan_partitions+0x98/0x330
> 30) 2124 52 __blkdev_get+0x309/0x350
> 31) 2072 8 blkdev_get+0xf/0x20
> 32) 2064 44 register_disk+0xff/0x120
> 33) 2020 36 add_disk+0x6e/0xb0
> 34) 1984 44 sd_probe_async+0xfb/0x1d0
> 35) 1940 44 __async_schedule+0xf4/0x1b0
> 36) 1896 8 async_schedule+0x12/0x20
> 37) 1888 60 sd_probe+0x305/0x360
> 38) 1828 44 really_probe+0x63/0x170
> 39) 1784 36 driver_probe_device+0x5d/0x60
> 40) 1748 16 __device_attach+0x49/0x50
> 41) 1732 32 bus_for_each_drv+0x5b/0x80
> 42) 1700 24 device_attach+0x6b/0x70
> 43) 1676 16 bus_attach_device+0x47/0x60
> 44) 1660 76 device_add+0x33d/0x400
> 45) 1584 52 scsi_sysfs_add_sdev+0x6a/0x2c0
> 46) 1532 108 scsi_add_lun+0x44b/0x460
> 47) 1424 116 scsi_probe_and_add_lun+0x182/0x4e0
> 48) 1308 36 __scsi_add_device+0xd9/0xe0
> 49) 1272 44 ata_scsi_scan_host+0x10b/0x190
> 50) 1228 24 async_port_probe+0x96/0xd0
> 51) 1204 44 __async_schedule+0xf4/0x1b0
> 52) 1160 8 async_schedule+0x12/0x20
> 53) 1152 48 ata_host_register+0x171/0x1d0
> 54) 1104 60 ata_pci_sff_activate_host+0xf3/0x230
> 55) 1044 44 ata_pci_sff_init_one+0xea/0x100
> 56) 1000 48 amd_init_one+0xb2/0x190
> 57) 952 8 local_pci_probe+0x13/0x20
> 58) 944 32 pci_device_probe+0x68/0x90
> 59) 912 44 really_probe+0x63/0x170
> 60) 868 36 driver_probe_device+0x5d/0x60
> 61) 832 20 __driver_attach+0x89/0xa0
> 62) 812 32 bus_for_each_dev+0x5b/0x80
> 63) 780 12 driver_attach+0x1e/0x20
> 64) 768 72 bus_add_driver+0x14b/0x2d0
> 65) 696 36 driver_register+0x6e/0x150
> 66) 660 20 __pci_register_driver+0x53/0xc0
> 67) 640 8 amd_init+0x14/0x16
> 68) 632 572 do_one_initcall+0x2b/0x1d0
> 69) 60 12 do_basic_setup+0x56/0x6a
> 70) 48 20 kernel_init+0x84/0xce
> 71) 28 28 kernel_thread_helper+0x7/0x10
>
> There's a lot of fat functions on that stack trace, but
> the largest of all is do_one_initcall(). This is due to
> the boot trace entry variables being on the stack.
>
> Fixing this is relatively easy, initcalls are fundamentally
> serialized, so we can move the local variables to file scope.
>
> Note that this large stack footprint was present for a
> couple of months already - what pushed my system over
> the edge was the addition of kmemleak to the call-chain:
>
> 6) 3328 36 allocate_slab+0xb1/0x100
> 7) 3292 36 new_slab+0x1c/0x160
> 8) 3256 36 __slab_alloc+0x133/0x2b0
> 9) 3220 4 kmem_cache_alloc+0x1bb/0x1d0
> 10) 3216 108 create_object+0x28/0x250
> 11) 3108 40 kmemleak_alloc+0x81/0xc0
> 12) 3068 24 kmem_cache_alloc+0x162/0x1d0
> 13) 3044 52 scsi_pool_alloc_command+0x29/0x70
>
> This pushes the total to ~3800 bytes, only a tiny bit
> more was needed to corrupt the on-kernel-stack thread_info.
>
> The fix reduces the stack footprint from 572 bytes
> to 28 bytes.
btw., it will just take two more features like kmemleak to trigger
hard to debug stack overflows again on 32-bit. We are right at the
edge and this situation is not really fixable in a reliable way
anymore.
So i think we should be more drastic and solve the real problem: we
should drop 4K stacks and 8K combo-stacks on 32-bit, and go
exclusively to 8K split stacks on 32-bit.
I.e. the stack size will be 'unified' too between 64-bit and 32-bit
to a certain degree: process stacks will be 8K on both 64-bit and
32-bit x86, IRQ stacks will be separate. (on 64-bit we also have the
IST stacks for certain exceptions that further isolates things)
This will simplify the 32-bit situation quite a bit and removes a
contentious config option and makes the kernel more robust in
general. 8K combo stacks are not safe due to irq nesting and 4K
isolated stacks are not enough. 8K isolated stacks is the way to go.
Opinions?
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists