linux-kernel - Re: multi-second application stall in open()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKA=qzYmh8DD7J__WWDyX_2fFQbSx+M=k4ttqRTD3M8nao-WVg@mail.gmail.com>
Date:	Thu, 21 Jun 2012 16:11:18 -0500
From:	Josh Hunt <joshhunt00@...il.com>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
	tj@...nel.org
Subject: Re: multi-second application stall in open()

On Thu, Jun 21, 2012 at 3:32 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
> On Thu, Jun 21, 2012 at 02:26:13PM -0500, Josh Hunt wrote:
>> On Thu, Mar 8, 2012 at 6:16 PM, Josh Hunt <joshhunt00@...il.com> wrote:
>> > On Thu, Mar 8, 2012 at 5:40 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
>> >> On Thu, Mar 08, 2012 at 04:22:31PM -0600, Josh Hunt wrote:
>> >>
>> >> [..]
>> >>> A crude bisection seems to show that if I revert "blkio: Set
>> >>> must_dispatch only if we decided to not dispatch the request"
>> >>> (bf7919371025412978268efca4b09dd847acb395) I no longer see the stalls
>> >>> in 2.6.35. However this does not seem to solve the problem if I revert
>> >>> it in 2.6.38.
>> >>
>> >> Strange. If this was the problem it should have fixed it in 2.6.38 also.
>> >> BTW, the blktrace you sent was from 2.6.38 or 2.6.35 kernels.
>> >>
>> >
>> > The blktrace is from 2.6.35. I just attached a portion of it earlier
>> > b/c I thought it would be too much to send the whole thing. I've gone
>> > ahead and attached the full output (bzip'd), along with the same
>> > workload with slice_idle set to 0 for comparison.
>> >
>> >>>
>> >>> By setting slice_idle to 0, does this basically disable plugging?
>> >>
>> >> It disables idling and not plugging.
>> >
>> > OK that's what I thought. I'm just confused about the differences in
>> > the blktrace outputs wrt plugging with slice_idle set to 0.
>> >
>> >>
>> >>> Based on the blktrace info it seems that something is going wrong with
>> >>> plugging with my testcase. I'm just wondering why setting slice_idle
>> >>> to 0 seems to resolve my issue? Also, since we see unplugs in the
>> >>> blktrace how could the requests still not be getting sent to the disk?
>> >>
>> >> Unplug will just try to kick the queue. That does not mean that request
>> >> will be dispatched. And that's the question that why are we not
>> >> dispatching requests.
>> >>
>> >> I had another look at traces and I think it is not just async write, but
>> >> there is was sync write request queued and we have not dispatched that
>> >> too for a long time.
>> >>
>> >> Added request here.
>> >>
>> >>  8,0    1    36921  5028.492019664   162  A  WS 63 + 8 <- (8,1) 0
>> >>  8,0    1    36922  5028.492021620   162  Q  WS 63 + 8 [sync_supers]
>> >>  8,0    1    36923  5028.492029721   162  G  WS 63 + 8 [sync_supers]
>> >>  8,0    1    36924  5028.492040617   162  I   W 63 + 8 (   10896)
>> >> [sync_supers]
>> >>  8,0    1        0  5028.492044807     0  m   N cfq162 insert_request
>> >>  8,0    1        0  5028.492046763     0  m   N cfq162 add_to_rr
>> >>
>> >> And after a long time we dispatched the request.
>> >>
>> >>  8,0    0        0  5050.116841906     0  m   N cfq162 set_active wl_prio:0 wl_type:1
>> >>  8,0    0        0  5050.116844979     0  m   N cfq162 fifo=ffff8800e8787aa0
>> >>  8,0    0        0  5050.116846655     0  m   N cfq162 dispatch_insert
>> >>  8,0    0        0  5050.116849728     0  m   N cfq162 dispatched a
>> >> request
>> >>  8,0    0        0  5050.116851683     0  m   N cfq162 activate rq, drv=1
>> >>  8,0    0    36518  5050.116853360   166  D   W 63 + 8 (21624812743)
>> >> [kblockd/0]
>> >>  8,0    0    36519  5050.117236650  9671  C   W 63 + 8 (  383290) [0]
>> >>
>> >> So it is not async requestss being starved by sync request issue, most
>> >> likely.
>> >>
>> >
>> > I wanted to point out again that this is a very small io load. On the
>> > avg of a few KB/s.
>> >
>> >> Are you using any of the blk cgroup stuff?
>> >
>> > No, none of that is enabled in the kernel.
>> >
>> >>
>> >> Can you put some more trace messages to figure out what's happening.
>> >> I think you can try putting some trace messages in following functions.
>> >>
>> >> __blk_run_queue()
>> >> cfq_select_queue()
>> >>
>> >> and try to narrow down why CFQ refuses to dispatch the request when
>> >> this happens.
>> >
>> > Sure. I'll add some more traces here and see if it sheds any light on the issue.
>> >
>> > Thanks again for your help.
>> >
>> >>
>> >> Thanks
>> >> Vivek
>> >
>> > --
>> > Josh
>>
>> It looks like I am hitting this issue again. I upgraded our system to
>> the 3.0.13 kernel and it appeared to resolve the issues I was seeing.
>> Unfortunately it is still occurring now just less frequently.
>>
>> I've gone ahead and instrumented cfq_select_queue and have some more
>> information. To quickly recap my issue - using CFQ we are seeing an
>> application's async writes stall for 5-15s. This is using a sata_svw
>> controller and a rotating drive which does not have NCQ enabled. If I
>> change slice_idle to 0 the stalls appear to go away. I've also tried
>> using the deadline scheduler and do not see the issue with that. I dug
>> through CFQ and identified the line which with slice_idle set to 0
>> gets bypassed and allows my workload to run without stalling (I added
>> in the blktrace logging for my debugging):
>>
>>         if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
>>                 service_tree = cfqq->service_tree;
>>                 cfq_log_cfqq(cfqd, cfqq, "keep_queue st->count:%d, dispatch:%u",
>>                         service_tree->count, cfqq->dispatched);
>>                 cfqq = NULL;
>>                 goto keep_queue;
>>         }
>>
>> When I have slice_idle set to 0, this gets bypassed and my workload
>> runs fine. I've verified this by removing this code and running the
>> workload with that kernel. I'm not saying it should be removed, just
>> that bypassing this code helps my workload for whatever reason.
>>
>> Here are some snippets from my blktrace log (attached):
>>
>> Insert:
>>   8,0    1        0  4466.115833485     0  m   N cfq5864 insert_request
>>   8,0    1        0  4466.115835161     0  m   N cfq20720 Idling.
>> st->count:2, hw_tag:0
>>   8,0    1  1499412  4466.115836278  6023  I   W 7078615 + 256 (
>> 311495) [flush-8:0]
>>   8,0    1        0  4466.115837675     0  m   N cfq5864 insert_request
>>   8,0    1        0  4466.115840190     0  m   N cfq20720 Idling.
>> st->count:2, hw_tag:0
>>   8,0    1  1499413  4466.115841307  6023  U   N [flush-8:0] 8
>>   8,0    1        0  4466.115842983     0  m   N cfq20720 Idling.
>> st->count:2, hw_tag:0
>>   8,0    1        0  4466.115844660     0  m   N cfq20720 keep_queue
>> st->count:2, dispatch:1
>>
>> Dispatch:
>>   8,0    0        0  4478.127274376     0  m   N cfq5864 complete rqnoidle 0
>>   8,0    0        0  4478.127318796     0  m   N cfq5864 Not idling. st->count:1
>>   8,0    0        0  4478.127320472     0  m   N cfq5864 dispatch_insert
>>   8,0    0        0  4478.127322148     0  m   N cfq5864 dispatched a request
>>   8,0    0        0  4478.127324104     0  m   N cfq5864 activate rq, drv=2
>>   8,0    0  2876000  4478.127325221  7232  D   W 7078615 + 256
>> (12011488943) [testd]
>>   8,0    0  2876001  4478.128414478  7232  C   W 7078135 + 264 ( 1250173) [0]
>>   8,0    0        0  4478.128445767     0  m   N cfq5864 complete rqnoidle 0
>>
>> Also in the blktrace I see gaps of 3-4s where nothing happens:
>>  8,0    0        0  4466.893959719     0  m   N cfq20723 del_from_rr
>>   8,0    0        0  4466.893961395     0  m   N cfq schedule dispatch
>>   8,0    1  1499539  4470.000135552   174  A  WS 63 + 8 <- (8,1) 0
>>   8,0    1  1499540  4470.000138066   174  Q  WS 63 + 8 [sync_supers]
>>
>> Meanwhile the requests issued earlier are just sitting on the queue
>> waiting to be dispatched.
>>
>> To accelerate the issue I'm constantly dropping the cache - while :;
>> do echo 1 > /proc/sys/vm/drop_caches; done. The issue occurs without
>> it, but this helps reproduce it in about an hour. I also wanted to
>> point out that I see a decent amount of syslogd activity in the trace
>> and know that it does do fsync(). I'm not sure if that's helping the
>> stalling, but wanted to mention that.
>
> This is strange. I can see in logs that for some reason writes are not
> being dispatched and queue is doing nothing.
>
>  8,0    0        0  4466.132851211     0  m   N cfq schedule dispatch
>  8,0    0        0  4466.132854564     0  m   N cfq20720 keep_queue
> timer_pending st->count:2, dispatch:0
>  8,0    0        0  4466.139951919     0  m   N cfq idle timer fired
>  8,0    0        0  4466.139954713     0  m   N cfq20720 slice expired
> t=0
>  8,0    0        0  4466.139958066     0  m   N cfq20720 sl_used=17
> disp=21 charge=17 iops=0 sect=176
>  8,0    0        0  4466.139959742     0  m   N cfq20720 del_from_rr
>  8,0    0        0  4466.139963653     0  m   N cfq schedule dispatch
>  8,0    1  1499521  4466.791207736  7570  A   R 7258191 + 8 <- (8,1)
> 7258128
>  8,0    1  1499522  4466.791209692  7570  Q   R 7258191 + 8 [httpdmon]
>  8,0    1  1499523  4466.791217514  7570  G   R 7258191 + 8 [httpdmon]
>  8,0    1  1499524  4466.791221705  7570  I   R 7258191 + 8 (    4191)
> [httpdmon]
>  8,0    1        0  4466.791227572     0  m   N cfq7570 insert_request
>  8,0    1        0  4466.791229248     0  m   N cfq7570 add_to_rr
>  8,0    1        0  4466.791235953     0  m   N cfq20719 set_active
> wl_prio:0 wl_type:2
>
> Here we deleted queue 20720 and did nothing for .6 seconds and from
> previous logs it is visible that writes are pending and queued.
>
> For some reason cfq_schedule_dispatch() did not lead to kicking queue
> or queue was kicked but somehow write queue was not selected for
> dispatch (A case of corrupt data structures?).
>
> Are you able to reproduce this issue on latest kernels (3.5-rc2?). I would

I cannot do it with my current application, but will try doing it another way.

> say put some logs in select_queue() and see where did it bail out. That

Well I did add some instrumentation in select_queue, the "keep_queue
st->count:%d, dispatch:%u" line I mentioned above, but I will add more
and retest.

I'm attaching a similar run with no stalls when I set slice_idle to 0.

> will confirm that select queue was called and can also give some details
> why we did not select async queue for dispatch. (Note: select_queue is called
> multiple times so putting trace point there makes logs very verbose).
>
> Thanks
> Vivek
-- 
Josh

Download attachment "blktrace-slice_idle-0.bz2" of type "application/x-bzip2" (120664 bytes)