[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111107122327.GA2699@osiris.boeblingen.de.ibm.com>
Date: Mon, 7 Nov 2011 13:23:27 +0100
From: Heiko Carstens <heiko.carstens@...ibm.com>
To: Mike Snitzer <snitzer@...hat.com>
Cc: "Jun'ichi Nomura" <j-nomura@...jp.nec.com>,
James Bottomley <James.Bottomley@...senPartnership.com>,
Steffen Maier <maier@...ux.vnet.ibm.com>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
Jens Axboe <axboe@...nel.dk>, Hannes Reinecke <hare@...e.de>,
Linux Kernel <linux-kernel@...r.kernel.org>,
Alan Stern <stern@...land.harvard.edu>,
Thadeu Lima de Souza Cascardo <cascardo@...ux.vnet.ibm.com>,
"Taraka R. Bodireddy" <tarak.reddy@...ibm.com>,
"Seshagiri N. Ippili" <seshagiri.ippili@...ibm.com>,
"Manvanthara B. Puttashankar" <mputtash@...ibm.com>,
Jeff Moyer <jmoyer@...hat.com>,
Shaohua Li <shaohua.li@...el.com>, gmuelas@...ibm.com
Subject: Re: [GIT PULL] Queue free fix (was Re: [PATCH] block: Free queue
resources at blk_release_queue())
On Fri, Nov 04, 2011 at 09:30:52AM -0400, Mike Snitzer wrote:
> > FWIW, yet another use-after-free crash, this time however in multipath_end_io:
> >
> > [96875.870593] Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000
> > [96875.870602] Oops: 0038 [#1]
> > [96875.870674] PREEMPT SMP DEBUG_PAGEALLOC
> > [96875.870683] Modules linked in: dm_round_robin sunrpc ipv6 qeth_l2 binfmt_misc dm_multipath scsi_dh dm_mod qeth ccwgroup [la\
> > st unloaded: scsi_wait_scan]
> > [96875.870722] CPU: 2 Tainted: G W 3.0.7-50.x.20111024-s390xdefault #1
> > [96875.870728] Process udevd (pid: 36697, task: 0000000072c8a3a8, ksp: 0000000057c43868)
> > [96875.870732] Krnl PSW : 0704200180000000 000003e001347138 (multipath_end_io+0x50/0x140 [dm_multipath])
> > [96875.870746] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
> > [96875.870751] Krnl GPRS: 0000000000000000 000003e000000000 6b6b6b6b6b6b6b6b 00000000717ab940
> > [96875.870755] 0000000000000000 00000000717abab0 0000000000000002 0700000000000008
> > [96875.870759] 0000000000000002 0000000000000000 0000000058dd37a8 000000006f845478
> > [96875.870764] 000003e0012e1000 000000005613d1f0 000000007a737bf0 000000007a737ba0
> > [96875.870768] Krnl Code: 000003e00134712a: b90200dd ltgr %r13,%r13
> > [96875.870793] 000003e00134712e: a7840017 brc 8,3e00134715c
> > [96875.870800] 000003e001347132: e320d0100004 lg %r2,16(%r13)
> > [96875.870809] >000003e001347138: e31020180004 lg %r1,24(%r2)
> > [96875.870818] 000003e00134713e: e31010580004 lg %r1,88(%r1)
> > [96875.870827] 000003e001347144: b9020011 ltgr %r1,%r1
> > [96875.870835] 000003e001347148: a784000a brc 8,3e00134715c
> > [96875.870841] 000003e00134714c: 41202018 la %r2,24(%r2)
> > [96875.870889] Call Trace:
> > [96875.870892] ([<0700000000000008>] 0x700000000000008)
> > [96875.870897] [<000003e0012e3662>] dm_softirq_done+0x9a/0x140 [dm_mod]
> > [96875.870915] [<000000000040d29c>] blk_done_softirq+0xd4/0xf0
> > [96875.870925] [<00000000001587c2>] __do_softirq+0xda/0x398
> > [96875.870932] [<000000000010f47e>] do_softirq+0xe2/0xe8
> > [96875.870940] [<0000000000158e2c>] irq_exit+0xc8/0xcc
> > [96875.870945] [<00000000004ceb48>] do_IRQ+0x910/0x1bfc
> > [96875.870953] [<000000000061a164>] io_return+0x0/0x16
> > [96875.870961] [<000000000019c84e>] lock_acquire+0xd2/0x204
> > [96875.870969] ([<000000000019c836>] lock_acquire+0xba/0x204)
> > [96875.870974] [<0000000000615f8e>] mutex_lock_killable_nested+0x92/0x520
> > [96875.870983] [<0000000000292796>] vfs_readdir+0x8a/0xe4
> > [96875.870992] [<00000000002928e0>] SyS_getdents+0x60/0xe8
> > [96875.870999] [<0000000000619af2>] sysc_noemu+0x16/0x1c
> > [96875.871024] [<000003fffd1ec83e>] 0x3fffd1ec83e
> > [96875.871028] INFO: lockdep is turned off.
> > [96875.871031] Last Breaking-Event-Address:
> > [96875.871037] [<000003e0012e3660>] dm_softirq_done+0x98/0x140 [dm_mod]
[...]
> OK, thanks for the backstory.
>
> That is the same type of testing we've been doing with some partners
> for RHEL6.2 with the qla2xxx driver. They have seen the same crash that
> you originally reported here: https://lkml.org/lkml/2011/10/31/64
>
> The really interesting observation that was made is that the qla2xxx
> driver was made lockless in RHEL6.2. We've found that reverting the
> qla2xxx lockless changes eliminates the problems seen with it and I/O
> stress testing with multipath path failures.
>
> The zfcp driver was also made lockless upstream, via this commit:
> e55f875 [SCSI] zfcp: Issue FCP command without holding SCSI host_lock
>
> It would be great if you could try reverting e55f875 and see how your
> testing goes.
Ok, we did that, and of course it sometimes runs into deadlocks (since
the commit fixed a deadlock), however we still see the same crash in
multipath_end_io() from above.
> If doing so resolves the crashes for you then the post mortem on why
> these lockless SCSI driver changes are causing such odd multipath
> completion failures is going to be "fun" ;)
Must be something different then...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists