linux-kernel - Re: multi-second application stall in open()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKA=qzaHJ0OcbvtT-mwKBUBvzPkxE4r-bW6gDvicKhoQeyFWEg@mail.gmail.com>
Date:	Thu, 8 Mar 2012 16:22:31 -0600
From:	Josh Hunt <joshhunt00@...il.com>
To:	Jens Axboe <axboe@...nel.dk>, Vivek Goyal <vgoyal@...hat.com>
Cc:	linux-kernel@...r.kernel.org, tj@...nel.org
Subject: Re: multi-second application stall in open()

On Wed, Mar 7, 2012 at 3:08 PM, Josh Hunt <joshhunt00@...il.com> wrote:
> On Wed, Mar 7, 2012 at 12:56 PM, Jens Axboe <axboe@...nel.dk> wrote:
>> On 03/07/2012 05:28 PM, Vivek Goyal wrote:
>>> On Wed, Mar 07, 2012 at 07:43:42AM -0600, Josh Hunt wrote:
>>>> On Tue, Mar 6, 2012 at 3:56 PM, Josh Hunt <joshhunt00@...il.com> wrote:
>>>>> I am trying to diagnose a stall in an application when making a call
>>>>> to open() with O_TRUNC. The issue was originally seen on 2.6.38 and
>>>>> I've since reproduced on 2.6.35. I believe the last kernel that I did
>>>>> not see the stall was 2.6.32. The system is under low to moderate cpu
>>>>> and memory load when the issue occurs. The filesystem is ext2, but I
>>>>> mounted it as ext4 and still saw the issue.
>>>>>
>>>>> To me it appears the O_TRUNC needs to truncate pages in the page
>>>>> cache, but maybe those are already out for writeback? This causes the
>>>>> open() to wait until that operation completes at which time it is then
>>>>> able to proceed. If that's true then it looks like the writeback is
>>>>> sitting on the queue forever waiting (tens of seconds) to be
>>>>> dispatched to the device.
>>>>>
>>>>> Here's some more info/data I collected:
>>>>> (Unfortunately the data here and attached are from separate runs so
>>>>> the #s will not match up file to file.)
>>>>>
>>>>> Snippet of an strace when it is hanging:
>>>>> 1329323286.589182 close(37)             = 0 <0.000010>
>>>>> 1329323286.589212 clock_gettime(0x5 /* CLOCK_??? */, {1329323286,
>>>>> 587579583}) = 0 <0.000007>
>>>>> 1329323286.589241 open("/usr/local/etc/abcd/known/test808",
>>>>> O_WRONLY|O_CREAT|O_TRUNC, 0644)                = 37 <76.841586>
>>>>> 1329323363.441841 fchmod(37, 0644)      = 0 <0.000012>
>>>>> 1329323363.441882 fcntl64(37, F_GETFL)  = 0x1 (flags O_WRONLY) <0.000011>
>>>>> 1329323363.441915 fcntl64(37, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 <0.000010>
>>>>>
>>>>> To debug this more I enabled the hang check timer to dump stack when
>>>>> anything hung for longer than 10s. There I see the the hanging
>>>>> application appearing to be stuck in sync_page(). See attached dmesg
>>>>> for traces.
>>>>>
>>>>> I've also captured "iostat -xm 1" while the application was running
>>>>> and have attached a snippet of that when the problem is occurring. I
>>>>> see 100% utilization for many seconds and then eventually the
>>>>> operations complete, but the await time is extremely high. To me there
>>>>> doesn't look like much IO going on which could be starving out any
>>>>> requests.
>>>>>
>>>>> I have not been able to perform a full bisection to better understand
>>>>> what may be causing this, but do see this on 2.6.38, a little less
>>>>> frequently but still occurs on 2.6.35, and no longer see the issue on
>>>>> 2.6.32. If I switch slice_idle to 0 or change to the deadline
>>>>> scheduler the issue appears to go away in both 2.6.35 and 38.
>>>
>>> Is it happening on latest kernel (3.3-rc6) ?
>>>
>>> Can you change the IO scheduler to deadline on sda and see if the issue
>>> still happens. (echo deadline > /sys/block/sda/queue/scheduler).
>>>
>>> blktrace of cfq look odd. I see that some IO (async writes) are being
>>> submitted but CFQ did not dispatch it for a long time.  Even some unplugs
>>> came in still nothing happened. Also no completions are happening during
>>> that window. Not sure why CFQ refuses to dispatch queued writes.
>>>
>>> Request added by flusher.
>>>
>>>   8,0    1    36926  5028.546000122  2846  A   W 20147012 + 8 <- (8,3)
>>> 3375152
>>>   8,0    1    36927  5028.546001798  2846  Q   W 20147012 + 8 [flush-8:0]
>>>   8,0    1    36928  5028.546009900  2846  G   W 20147012 + 8 [flush-8:0]
>>>   8,0    1    36929  5028.546014649  2846  I   W 20147012 + 8 (    4749)
>>> [flush-8:0]
>>>
>>> And this request is dispatched after 22 seconds.
>>>
>>>   8,0    1    37056  5050.117337221   162  D   W 20147012 + 16 (21571322572) [sync_supers]
>>>
>>>
>>> And it completes fairly fast.
>>>
>>>  8,0    0    36522  5050.117686149  9657  C   W 20147012 + 16 (  348928)
>>> [0]
>>>
>>> So not sure why CFQ will hold that request for so long when other IO is
>>> not happening.
>>>
>>> Please try latest kernels and see if deadline has the same issue. If not,
>>> then we know somehow CFQ is related. If it still happens on latest
>>> kernels, can you try capturing blktrace again when you are experiencing
>>> the delays.
>>
>> I'm seeing something very similar here. While testing the gtk fio
>> client, I ran a job that issued a lot of random reads to my primary
>> disk. 64 ios in flight, direct, libaio, 512b random reads. Firefox
>> essentially froze, windows starting freezing up around me.
>>
>> I'll try and reproduce, but a quick guess would be that things starting
>> piling up in fsync() or stalling on writes in general, since we are
>> heavily starving those.
>>
>> --
>> Jens Axboe
>>
> The stack traces from the hang check timer do seem to always show
> syslogd calling fsync in my case. So this could be fsync-related. In
> my case there does not seem to be much IO going on though (at least
> from what I can tell.)
>
> --
> Josh

A crude bisection seems to show that if I revert "blkio: Set
must_dispatch only if we decided to not dispatch the request"
(bf7919371025412978268efca4b09dd847acb395) I no longer see the stalls
in 2.6.35. However this does not seem to solve the problem if I revert
it in 2.6.38.

By setting slice_idle to 0, does this basically disable plugging?
Based on the blktrace info it seems that something is going wrong with
plugging with my testcase. I'm just wondering why setting slice_idle
to 0 seems to resolve my issue? Also, since we see unplugs in the
blktrace how could the requests still not be getting sent to the disk?

Thanks for your help.
-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/