lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 25 Jun 2012 18:05:57 -0500
From:	Josh Hunt <joshhunt00@...il.com>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
	tj@...nel.org
Subject: Re: multi-second application stall in open()

On Mon, Jun 25, 2012 at 4:18 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
> On Mon, Jun 25, 2012 at 11:22:32AM -0500, Josh Hunt wrote:
>> On Mon, Jun 25, 2012 at 8:30 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
>> > On Fri, Jun 22, 2012 at 04:34:07PM -0500, Josh Hunt wrote:
>> >
>> > [..]
>> >> Shouldn't the queue stay on the RR list until it is empty?
>> >
>> > This does look odd. cfqq should stay on service tree as long as it has
>> > requests.
>> >
>> > Can you attach the full log again. Also make sure that blktrace is not
>> > dropping any trace events.
>> >
>> > In slice_expire() we check following.
>> >
>> >        if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
>> >                cfq_del_cfqq_rr(cfqd, cfqq);
>> >
>> > So for some reason RB_EMPTY_ROOT() is returning true. But we must have
>> > added the request and it should not have been empty.
>> >
>> > cfq_insert_request()
>> >  cfq_add_rq_rb()
>> >    elv_rb_add()
>> >
>> > So may be little more tracing after request addition will help. Just check
>> > that RB_EMPTY_ROOT() is not true after addition of request and also print
>> > number of requests queued.
>> >
>> > In slice_expired() we can probably put a BUG_ON() which checks following.
>> >
>> > BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list) && (cfqq->queued[0] || cfqq->queued[1]));
>> >
>> > Thanks
>> > Vivek
>>
>> Vivek
>>
>> First off thanks for all the time you've spent helping me on this :)
>>
>> I'm attaching the log. I will add more instrumentation based on your
>> mail and let you know the results.
>
> I suspect that cfq5917 are actually two queues. One sync and other async.
> In latest kernels, trace messages append "A" or "S" in queue name to
> represent async/sync queue. Why do I think so.
>
> - I see that cfq5917 has dispatched both "W" and "R" requests. "R" don't
>  get queued in sync queues and vice a versa.
>
> - Once cfq5917 has preempted cfq5917. A queue can't preempt itself.
>
>  8,0    1   296615  4847.030318652  5917  I   R 1572951 + 8 (    2793)
> [syslogd]
>  8,0    1        0  4847.030321446     0  m   N cfq5917 insert_request
>  8,0    1        0  4847.030323122     0  m   N cfq5917 add_to_rr
>  8,0    1        0  4847.030325636     0  m   N cfq5917 preempt
>  8,0    1        0  4847.030327033     0  m   N cfq5917 slice expired t=1
>  8,0    1        0  4847.030328710     0  m   N cfq5917 resid=39
>  8,0    1        0  4847.030331783     0  m   N cfq5917 sl_used=1 disp=7
> charge=1 iops=0 sect=96
>
> Hence I think that it is just a chance that both sync and async queues
> have got same cfqq->pid. If you just append anohter character to represent
> sync and async property of queue, that can cofirm it.
>
> Thanks
> Vivek

You are right about the sync/async queues :) The idea that the queue
was being deleted with requests still on it was incorrect. I've
attached an updated log with some more details. It's starting to seem
like the problem is that the kblockd thread is not kicking the queue.
I will try and look in that direction some more.

-- 
Josh

Download attachment "sda.parsed-stall" of type "application/octet-stream" (288725 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ