lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 21 Jun 2012 16:28:24 -0500
From:	Josh Hunt <joshhunt00@...il.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Vivek Goyal <vgoyal@...hat.com>, Jens Axboe <axboe@...nel.dk>,
	linux-kernel@...r.kernel.org
Subject: Re: multi-second application stall in open()

On Thu, Jun 21, 2012 at 3:36 PM, Tejun Heo <tj@...nel.org> wrote:
> Hey, Vivek.
>
> On Thu, Jun 21, 2012 at 04:32:17PM -0400, Vivek Goyal wrote:
>> Here we deleted queue 20720 and did nothing for .6 seconds and from
>> previous logs it is visible that writes are pending and queued.
>>
>> For some reason cfq_schedule_dispatch() did not lead to kicking queue
>> or queue was kicked but somehow write queue was not selected for
>> dispatch (A case of corrupt data structures?).
>>
>> Are you able to reproduce this issue on latest kernels (3.5-rc2?). I would
>> say put some logs in select_queue() and see where did it bail out. That
>> will confirm that select queue was called and can also give some details
>> why we did not select async queue for dispatch. (Note: select_queue is called
>> multiple times so putting trace point there makes logs very verbose).
>
> Some people are putting in watchdog timers in block layer to kick cfq
> when it stalls with pending requests.  The cfq code there has diverged
> quite a bit from upstream so I have no idea whether it's caused by the
> same issue.  The symptom sounds exactly the same tho.  So, yeah, I
> think it isn't too unlikely that we have a cfq logic bug leading to
> stalls.  :(
>
> --
> tejun
Tejun

When you say the code has diverged from upstream, do you mean from 3.0
to 3.5? Or maybe I'm misunderstanding what you're getting at. Also, if
you have any links to the watchdog timer code you're referring to I
would appreciate it.

Thanks
--
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ