linux-kernel - [BUG] Unable to handle kernel NULL pointer dereference...as_move_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070122183510.GA19905@andrew-vasquezs-computer.local>
Date:	Mon, 22 Jan 2007 10:35:10 -0800
From:	Andrew Vasquez <andrew.vasquez@...gic.com>
To:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Cc:	Linux-SCSI Mailing List <linux-scsi@...r.kernel.org>,
	Nigel Kirkland <nigel.kirkland@...gic.com>
Subject: [BUG] Unable to handle kernel NULL pointer dereference...as_move_to_dispatch+0x11/0x135

All,

We've been trying to track down a nagging regression  seen during some
port-disable/enable testing.  The problem occurs anywhere from
20 minutes to 120 minutes of testing under very minimal I/O load:

	[ 1143.890598] Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: 
	[ 1143.896087]  [<ffffffff802ed04f>] as_move_to_dispatch+0x11/0x135
	[ 1143.904574] PGD 120ba067 PUD 31e8c067 PMD 0 
	[ 1143.908895] Oops: 0000 [1] SMP 
	[ 1143.912065] CPU 1 
	[ 1143.914092] Modules linked in: qla2xxx scsi_transport_fc
	[ 1143.919453] Pid: 8362, comm: dt Not tainted 2.6.19 #39
	[ 1143.924588] RIP: 0010:[<ffffffff802ed04f>]  [<ffffffff802ed04f>] as_move_to_dispatch+0x11/0x135
	[ 1143.933300] RSP: 0018:ffff8100021f7cf0  EFLAGS: 00010046
	[ 1143.938610] RAX: 0000000000000000 RBX: ffff8100261392f0 RCX: 0000000000000000
	[ 1143.945736] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8100261392f0
	[ 1143.952865] RBP: 0000000000000000 R08: ffffffff804d0928 R09: ffff8100351a96e0
	[ 1143.959992] R10: ffff8100351a96e0 R11: 0000000000000000 R12: 0000000000000000
	[ 1143.967118] R13: 0000000000000000 R14: 0000000000000282 R15: 0000000000000001
	[ 1143.974248] FS:  0000000000000000(0000) GS:ffff8100021c5930(0000) knlGS:0000000000000000
	[ 1143.982327] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
	[ 1143.988070] CR2: 0000000000000028 CR3: 000000003493b000 CR4: 00000000000006e0
	[ 1143.995198] Process dt (pid: 8362, threadinfo ffff8100207a2000, task ffff810023d86840)
	[ 1144.003103] Stack:  0000000000000000 ffff8100261392f0 0000000000000000 0000000000000001
	[ 1144.011183]  0000000000000000 ffffffff802ed591 0000000000000000 ffff810035795990
	[ 1144.018640]  ffff81000f11b3f8 ffff810035795990 ffff810035795990 ffffffff802e53d6
	[ 1144.025906] Call Trace:
	[ 1144.028548]  <IRQ>  [<ffffffff802ed591>] as_dispatch_request+0x39d/0x3aa
	[ 1144.035274]  [<ffffffff802e53d6>] elv_next_request+0x45/0x151
	[ 1144.041015]  [<ffffffff8037b3cf>] scsi_request_fn+0x7a/0x375
	[ 1144.046667]  [<ffffffff802e8fa0>] blk_run_queue+0x42/0x74
	[ 1144.052062]  [<ffffffff8037b2bc>] scsi_next_command+0x2d/0x39
	[ 1144.057805]  [<ffffffff8037b85b>] scsi_end_request+0xc0/0xce
	[ 1144.063460]  [<ffffffff8037ba03>] scsi_io_completion+0x149/0x351
	[ 1144.069464]  [<ffffffff8038aad6>] sd_rw_intr+0x247/0x256
	[ 1144.074770]  [<ffffffff80376b92>] scsi_finish_command+0x88/0x8d
	[ 1144.080686]  [<ffffffff8037c25e>] scsi_softirq_done+0xe7/0xef
	[ 1144.086429]  [<ffffffff802e614c>] blk_done_softirq+0x63/0x71
	[ 1144.092082]  [<ffffffff80226487>] __do_softirq+0x51/0xbd
	[ 1144.097394]  [<ffffffff8020a88c>] call_softirq+0x1c/0x28
	[ 1144.102701]  [<ffffffff8020bffa>] do_softirq+0x2e/0x94
	[ 1144.107836]  [<ffffffff8020c10e>] do_IRQ+0xae/0xc8
	[ 1144.112625]  [<ffffffff80209c81>] ret_from_intr+0x0/0xb
	[ 1144.117845]  <EOI>  [<ffffffff8027dfad>] __bio_add_page+0x1b8/0x1bf
	[ 1144.124135]  [<ffffffff8027ecb6>] blkdev_direct_IO+0x2c9/0x420
	[ 1144.129964]  [<ffffffff8021a1f4>] __activate_task+0x2d/0x3f
	[ 1144.135537]  [<ffffffff8040df0d>] _spin_unlock_irq+0x6/0x9
	[ 1144.141024]  [<ffffffff80241d86>] generic_file_direct_IO+0xa6/0xe8
	[ 1144.147196]  [<ffffffff80219efe>] __wake_up_common+0x42/0x6c
	[ 1144.152852]  [<ffffffff80242515>] generic_file_aio_read+0xcc/0x19d
	[ 1144.159029]  [<ffffffff8025d7cd>] do_sync_read+0xc8/0x10b
	[ 1144.164426]  [<ffffffff8023343c>] autoremove_wake_function+0x0/0x2e
	[ 1144.170684]  [<ffffffff8040df0d>] _spin_unlock_irq+0x6/0x9
	[ 1144.176166]  [<ffffffff8025c516>] do_filp_open+0x2d/0x3d
	[ 1144.181480]  [<ffffffff80259876>] poison_obj+0x35/0x44
	[ 1144.186620]  [<ffffffff8025e1c0>] vfs_read+0xac/0x14f
	[ 1144.191668]  [<ffffffff8025e322>] sys_read+0x45/0x6e
	[ 1144.196630]  [<ffffffff8020976e>] system_call+0x7e/0x83
	[ 1144.201852] 
	[ 1144.203348] 
	[ 1144.203349] Code: 8b 46 28 48 8b 56 78 25 01 80 00 00 83 f8 01 48 8d 46 78 41 
	[ 1144.212416] RIP  [<ffffffff802ed04f>] as_move_to_dispatch+0x11/0x135
	[ 1144.218783]  RSP <ffff8100021f7cf0>
	[ 1144.222272] CR2: 0000000000000028
	[ 1144.225866]  <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
	[ 1144.233878] in_atomic():1, irqs_disabled():1
	[ 1144.238147] 
	[ 1144.238148] Call Trace:
	[ 1144.242088]  <IRQ>  [<ffffffff8021ae13>] __might_sleep+0xba/0xbc
	[ 1144.248109]  [<ffffffff80235c33>] down_read+0x15/0x1e
	[ 1144.253163]  [<ffffffff8022e430>] blocking_notifier_call_chain+0x13/0x36
	[ 1144.259864]  [<ffffffff802247af>] do_exit+0x22/0x843
	[ 1144.264823]  [<ffffffff80219259>] do_page_fault+0x708/0x775
	[ 1144.270394]  [<ffffffff8021433a>] smp_local_timer_interrupt+0x32/0x53
	[ 1144.276826]  [<ffffffff802143b0>] smp_apic_timer_interrupt+0x55/0x68
	[ 1144.283174]  [<ffffffff8020a336>] apic_timer_interrupt+0x66/0x70
	[ 1144.289178]  [<ffffffff8040e12d>] error_exit+0x0/0x84
	[ 1144.294226]  [<ffffffff802ed04f>] as_move_to_dispatch+0x11/0x135
	[ 1144.300227]  [<ffffffff802ed591>] as_dispatch_request+0x39d/0x3aa
	[ 1144.306315]  [<ffffffff802e53d6>] elv_next_request+0x45/0x151
	[ 1144.312058]  [<ffffffff8037b3cf>] scsi_request_fn+0x7a/0x375
	[ 1144.317713]  [<ffffffff802e8fa0>] blk_run_queue+0x42/0x74
	[ 1144.323108]  [<ffffffff8037b2bc>] scsi_next_command+0x2d/0x39
	[ 1144.328850]  [<ffffffff8037b85b>] scsi_end_request+0xc0/0xce
	[ 1144.334505]  [<ffffffff8037ba03>] scsi_io_completion+0x149/0x351
	[ 1144.340507]  [<ffffffff8038aad6>] sd_rw_intr+0x247/0x256
	[ 1144.345815]  [<ffffffff80376b92>] scsi_finish_command+0x88/0x8d
	[ 1144.351730]  [<ffffffff8037c25e>] scsi_softirq_done+0xe7/0xef
	[ 1144.357474]  [<ffffffff802e614c>] blk_done_softirq+0x63/0x71
	[ 1144.363127]  [<ffffffff80226487>] __do_softirq+0x51/0xbd
	[ 1144.368437]  [<ffffffff8020a88c>] call_softirq+0x1c/0x28
	[ 1144.373745]  [<ffffffff8020bffa>] do_softirq+0x2e/0x94
	[ 1144.378880]  [<ffffffff8020c10e>] do_IRQ+0xae/0xc8
	[ 1144.383669]  [<ffffffff80209c81>] ret_from_intr+0x0/0xb
	[ 1144.388890]  <EOI>  [<ffffffff8027dfad>] __bio_add_page+0x1b8/0x1bf
	[ 1144.395170]  [<ffffffff8027ecb6>] blkdev_direct_IO+0x2c9/0x420
	[ 1144.401001]  [<ffffffff8021a1f4>] __activate_task+0x2d/0x3f
	[ 1144.406569]  [<ffffffff8040df0d>] _spin_unlock_irq+0x6/0x9
	[ 1144.412051]  [<ffffffff80241d86>] generic_file_direct_IO+0xa6/0xe8
	[ 1144.418224]  [<ffffffff80219efe>] __wake_up_common+0x42/0x6c
	[ 1144.423881]  [<ffffffff80242515>] generic_file_aio_read+0xcc/0x19d
	[ 1144.430056]  [<ffffffff8025d7cd>] do_sync_read+0xc8/0x10b
	[ 1144.435451]  [<ffffffff8023343c>] autoremove_wake_function+0x0/0x2e
	[ 1144.441713]  [<ffffffff8040df0d>] _spin_unlock_irq+0x6/0x9
	[ 1144.447194]  [<ffffffff8025c516>] do_filp_open+0x2d/0x3d
	[ 1144.452503]  [<ffffffff80259876>] poison_obj+0x35/0x44
	[ 1144.457639]  [<ffffffff8025e1c0>] vfs_read+0xac/0x14f
	[ 1144.462688]  [<ffffffff8025e322>] sys_read+0x45/0x6e
	[ 1144.467650]  [<ffffffff8020976e>] system_call+0x7e/0x83
	[ 1144.472870] 
	[ 1144.474364] Kernel panic - not syncing: Aiee, killing interrupt handler!

Basically what is happening from the FC side is the initiator executes
a simple dt test:

        dt of=/dev/raw/raw1 procs=8 oncerr=abort bs=16k disable=stats limit=2m passes=1000000 pattern=iot dlimit=2048

against a single lun (a very basic Windows target mode driver).
During the test a port-enable, port-disable script is running agains
the switch's port that is connected to the target (this occurs every
sixty seconds (for a disabled duration of 2 seconds).  Additionally,
the target itself is set to LOGO (logout) or drop off the topology
every 30 seconds.

This test runs fine up to 2.6.19.  Breakage appears to have been introduced in
2.6.20-rc1 (our firmware-engineer noticed the problem in 2.6.20-rc4).
Unfortunately our efforts a basic bisections have been fruitless as according
to git-bisect the bad commit is:

	commit 029530f810dd5147f7e59b939eb22cfbe0beea12
	Author: Adrian Bunk <bunk@...sta.de>
	Date:   Wed Dec 13 00:35:16 2006 -0800

	    [PATCH] one more EXPORT_UNUSED_SYMBOL removal

	    Signed-off-by: Adrian Bunk <bunk@...sta.de>
	    Signed-off-by: Andrew Morton <akpm@...l.org>
	    Signed-off-by: Linus Torvalds <torvalds@...l.org>

which doesn't make much sense...  Here's the log...

	$ git bisect log
	git-bisect start
	# good: [c3fe6924620fd733ffe8bc8a9da1e9cde08402b3] Linux 2.6.19
	git-bisect good c3fe6924620fd733ffe8bc8a9da1e9cde08402b3
	# bad: [cc016448b0bf0764928275d034e367753bde8162] Linux v2.6.20-rc1
	git-bisect bad cc016448b0bf0764928275d034e367753bde8162
	# good: [b2c03941b50944a268ee4d5823872f220809a3ba] IPMI: Allow hot system interface remove
	git-bisect good b2c03941b50944a268ee4d5823872f220809a3ba
	# good: [de1ba09b214056365d9082982905b255caafb7a2] fault injection: documentation and scripts
	git-bisect good de1ba09b214056365d9082982905b255caafb7a2
	# good: [116140b7f5c9182c86a4e419f81684209357aea7] Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
	git-bisect good 116140b7f5c9182c86a4e419f81684209357aea7
	# good: [775ba7ad491a154f99871fe603f03366e84ae159] Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
	git-bisect good 775ba7ad491a154f99871fe603f03366e84ae159
	# good: [69688262fb94e92a32f188b79c02dc32016d4927] update Tigran's email addresses
	git-bisect good 69688262fb94e92a32f188b79c02dc32016d4927
	# bad: [5faad620264290b17e80a8b0996b039ea0d5ac73] Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block
	git-bisect bad 5faad620264290b17e80a8b0996b039ea0d5ac73
	# bad: [3161986224a3faa8ccca3e665b7404d81e7ee3cf] fbdev: remove references to non-existent fbmon_valid_timings()
	git-bisect bad 3161986224a3faa8ccca3e665b7404d81e7ee3cf
	# bad: [c954e2a5d1c9662991a41282297ddebcadee0578] knfsd: nfsd4: make verify and nverify wrappers
	git-bisect bad c954e2a5d1c9662991a41282297ddebcadee0578
	# bad: [021d3a72459191a76e8e482ee4937ba6bc9fd712] knfsd: nfsd4: handling more nfsd_cross_mnt errors in nfsd4 readdir
	git-bisect bad 021d3a72459191a76e8e482ee4937ba6bc9fd712
	# bad: [b797b5beac966df5c5d96c0d39fe366f57135343] knfsd: svcrpc: fix gss krb5i memory leak
	git-bisect bad b797b5beac966df5c5d96c0d39fe366f57135343
	# bad: [b21a323710e77a27b2f66af901bd3640c30aba6e] remove the broken BLK_DEV_SWIM_IOP driver
	git-bisect bad b21a323710e77a27b2f66af901bd3640c30aba6e
	# bad: [029530f810dd5147f7e59b939eb22cfbe0beea12] one more EXPORT_UNUSED_SYMBOL removal
	git-bisect bad 029530f810dd5147f7e59b939eb22cfbe0beea12


Not sure how much help it is, but while trying to instrument
as-iosched.c, we can see that the failure has a fairly stable
signature:

	[15280.813479] as_dispatch_request: ad=ffff810038704da8 ; reads=0 ; writes=1 ; dir=0 ;fifo_list[async]=ffff81003f4071a0 ad-new_batch=0 ad->change_batch=0.
	[15280.827032] as_dispatch_request: q=ffff81003fdae050 ; #_reqs=128 ; lmerge=ffff81003f4071a0 ;.

which means:

	static int as_dispatch_request(request_queue_t *q, int force)
	{
		...
		const int writes = !list_empty(&ad->fifo_list[REQ_ASYNC]);

'writes' is set, the batch_data_dir has not changed
(ad->batch_data_dir != REQ_SYNC) in the following segment:

		...
		if (writes) {
	dispatch_writes:
			BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_ASYNC]));

			if (ad->batch_data_dir == REQ_SYNC) {
				ad->changed_batch = 1;

				/*
				 * new_batch might be 1 when the queue runs out of
				 * reads. A subsequent submission of a write might
				 * cause a change of batch before the read is finished.
				 */
				ad->new_batch = 0;
			}
			ad->batch_data_dir = REQ_ASYNC;
			ad->current_write_count = ad->write_batch_count;
			ad->write_batch_idled = 0;
			rq = ad->next_rq[ad->batch_data_dir];
			goto dispatch_request;
		}

ad->next_rq[ad->batch_data_dir] is NULL, and is then passed down to
as_move_to_dispatch() where the first dereference of rq:

	static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
	{
		const int data_dir = rq_is_sync(rq);

borks the machine.  What's odd (perhaps it's just our rudimentary
understanding of AS) is that there are segments of code where
ad->next_rq[REQ_ASYNC] is checked against NULL (in 'writes' case it is
not).

Anyway, any ideas or hints?  Attached is the .config used.

Thanks,
Andrew Vasquez

View attachment "config.txt" of type "text/plain" (25518 bytes)