[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.02.1105041124080.3005@ionos>
Date: Wed, 4 May 2011 11:52:44 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: Ingo Molnar <mingo@...e.hu>
cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Jens Axboe <axboe@...nel.dk>,
Andrew Morton <akpm@...ux-foundation.org>,
werner <w.landgraf@...ru>, "H. Peter Anvin" <hpa@...or.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs
On Wed, 4 May 2011, Ingo Molnar wrote:
> 1415 if (!nr_sectors)
> 1416 return 0;
> 1417
> 1418 /* Test device or partition size, when known. */
> 1419 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; <==== [ **CRASH** ]
> 1420 if (maxsector) {
> 1421 sector_t sector = bio->bi_sector;
> 1422
> 1423 if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
>
> bio->bi_bdev has become NULL?
>
> I do not think the _cond_resched() was called, judging from stack contents. But
> we just had an IRQ:
>
> [<c1d74030>] ? common_interrupt+0x30/0x40
>
> So we might have raced with block IO IRQ queue-completion/submission activites.
>
> But maybe it was a reschedule after all, just the stack does not carry any
> traces of it anymore. IRQs do not clear ->bi_bdev, right? Unless the bio
> refcounts are wrong and an IRQ's completion actually frees the bio, right?
Looking at the call chain that's impossible:
generic_make_request
submit_bio
submit_bh
submit_bh does:
bio = bio_alloc()
bio_get(bio)
submit_bio(bio)
bio_put(bio)
So that bio is not yet known to anything else than the calling
code.
One possibility is that bh->bdev is NULL when submit_bh() is called,
which I think is rather unlikely, but can be easily verified with
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2887,6 +2887,7 @@ int submit_bh(int rw, struct buffer_head * bh)
BUG_ON(!bh->b_end_io);
BUG_ON(buffer_delay(bh));
BUG_ON(buffer_unwritten(bh));
+ BUG_ON(!bh->b_bdev);
/*
* Only clear out a write error when rewriting
But I rather suspect, that CONFIG_SLUB=y is the thing we need to look
at. The lockless fastpath cmpxchg comes to my mind.
Either we generate broken code with that ELAN caused options or
that combo triggers some hidden problem in SLUB.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists