[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120621203615.GE4642@google.com>
Date: Thu, 21 Jun 2012 13:36:15 -0700
From: Tejun Heo <tj@...nel.org>
To: Vivek Goyal <vgoyal@...hat.com>
Cc: Josh Hunt <joshhunt00@...il.com>, Jens Axboe <axboe@...nel.dk>,
linux-kernel@...r.kernel.org
Subject: Re: multi-second application stall in open()
Hey, Vivek.
On Thu, Jun 21, 2012 at 04:32:17PM -0400, Vivek Goyal wrote:
> Here we deleted queue 20720 and did nothing for .6 seconds and from
> previous logs it is visible that writes are pending and queued.
>
> For some reason cfq_schedule_dispatch() did not lead to kicking queue
> or queue was kicked but somehow write queue was not selected for
> dispatch (A case of corrupt data structures?).
>
> Are you able to reproduce this issue on latest kernels (3.5-rc2?). I would
> say put some logs in select_queue() and see where did it bail out. That
> will confirm that select queue was called and can also give some details
> why we did not select async queue for dispatch. (Note: select_queue is called
> multiple times so putting trace point there makes logs very verbose).
Some people are putting in watchdog timers in block layer to kick cfq
when it stalls with pending requests. The cfq code there has diverged
quite a bit from upstream so I have no idea whether it's caused by the
same issue. The symptom sounds exactly the same tho. So, yeah, I
think it isn't too unlikely that we have a cfq logic bug leading to
stalls. :(
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists