[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070418123757.GC3796@kernel.dk>
Date: Wed, 18 Apr 2007 14:37:57 +0200
From: Jens Axboe <jens.axboe@...cle.com>
To: Neil Brown <neilb@...e.de>
Cc: Chuck Ebbert <cebbert@...hat.com>,
Brad Campbell <brad@...p.net.au>,
lkml <linux-kernel@...r.kernel.org>
Subject: Re: [OOPS] 2.6.21-rc6-git5 in cfq_dispatch_insert
On Tue, Apr 17 2007, Neil Brown wrote:
> On Monday April 16, cebbert@...hat.com wrote:
> >
> > cfq_dispatch_insert() was called with rq == 0. This one is getting really
> > annoying... and md is involved again (RAID0 this time.)
>
> Yeah... weird.
> RAID0 is so light-weight and so different from RAID1 or RAID5 that I
> feel fairly safe concluding that the problem isn't in or near md.
> But that doesn't help you.
Well the fact is that we have 3-4 distinct reports of this oops, and all
of them are using md. No reports have been filed where md isn't managing
the disks. While this doesn't conclusively put the finger of blame on
md, it is still likely. It could bug a CFQ bug too of course, or (more
likely), some bad interaction between md and CFQ.
> This really feels like a locking problem.
Very much.
> The problem occurs when ->next_rq is NULL, but ->sort_list.rb_node is
> not NULL. That happens plenty of times in the code (particularly as
> the first request is inserted) but always under ->queue_lock so it
> should never be visible to cfq_dispatch_insert..
>
> Except that drivers/scsi/ide-scsi.c:idescsi_eh_reset calls
> elv_next_request which could ultimately call __cfq_dispatch_requests
> without taking ->queue_lock (that I can see). But you probably aren't
> using ide-scsi (does anyone?).
>
> Given that interrupts are always disabled when queue_lock is taken, it
> might be useful to add
> WARN_ON(!irqs_disabled());
> every time ->next_rq is set.
> Something like the following.
>
> It might show something useful.... if we are lucky.
I had something similar for generic_unplug_request() as well, but didn't
see/hear any reports of it being tried out. Here's a complete debugging
patch for this and other potential dangers.
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b6491c0..8f749aa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -587,8 +587,11 @@ static void cfq_remove_request(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- if (cfqq->next_rq == rq)
+ BUG_ON(!irqs_disabled());
+ if (cfqq->next_rq == rq) {
cfqq->next_rq = cfq_find_next_rq(cfqq->cfqd, cfqq, rq);
+ BUG_ON(!cfqq->next_rq && !RB_EMPTY_ROOT(&cfqq->sort_list));
+ }
list_del_init(&rq->queuelist);
cfq_del_rq_rb(rq);
@@ -1637,6 +1640,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* check if this request is a better next-serve candidate)) {
*/
+ BUG_ON(!irqs_disabled());
cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
BUG_ON(!cfqq->next_rq);
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3de0695..c16863e 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1602,6 +1602,8 @@ EXPORT_SYMBOL(__generic_unplug_device);
**/
void generic_unplug_device(request_queue_t *q)
{
+ BUG_ON(irqs_disabled());
+
spin_lock_irq(q->queue_lock);
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
--
Jens Axboe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists