linux-kernel - Re: multi-second application stall in open()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120307195638.GI13430@redhat.com>
Date:	Wed, 7 Mar 2012 14:56:38 -0500
From:	Vivek Goyal <vgoyal@...hat.com>
To:	Jens Axboe <axboe@...nel.dk>
Cc:	Josh Hunt <joshhunt00@...il.com>, linux-kernel@...r.kernel.org,
	tj@...nel.org
Subject: Re: multi-second application stall in open()

On Wed, Mar 07, 2012 at 07:56:10PM +0100, Jens Axboe wrote:
[..]

> > 
> > blktrace of cfq look odd. I see that some IO (async writes) are being
> > submitted but CFQ did not dispatch it for a long time.  Even some unplugs
> > came in still nothing happened. Also no completions are happening during
> > that window. Not sure why CFQ refuses to dispatch queued writes.
> > 
> > Request added by flusher.
> > 
> >   8,0    1    36926  5028.546000122  2846  A   W 20147012 + 8 <- (8,3)
> > 3375152
> >   8,0    1    36927  5028.546001798  2846  Q   W 20147012 + 8 [flush-8:0]
> >   8,0    1    36928  5028.546009900  2846  G   W 20147012 + 8 [flush-8:0]
> >   8,0    1    36929  5028.546014649  2846  I   W 20147012 + 8 (    4749)
> > [flush-8:0]
> > 
> > And this request is dispatched after 22 seconds.
> > 
> >   8,0    1    37056  5050.117337221   162  D   W 20147012 + 16 (21571322572) [sync_supers]
> > 
> > 
> > And it completes fairly fast.
> > 
> >  8,0    0    36522  5050.117686149  9657  C   W 20147012 + 16 (  348928)
> > [0]
> > 
> > So not sure why CFQ will hold that request for so long when other IO is
> > not happening.
> > 
> > Please try latest kernels and see if deadline has the same issue. If not,
> > then we know somehow CFQ is related. If it still happens on latest
> > kernels, can you try capturing blktrace again when you are experiencing
> > the delays.
> 
> I'm seeing something very similar here. While testing the gtk fio
> client, I ran a job that issued a lot of random reads to my primary
> disk. 64 ios in flight, direct, libaio, 512b random reads. Firefox
> essentially froze, windows starting freezing up around me.
> 
> I'll try and reproduce, but a quick guess would be that things starting
> piling up in fsync() or stalling on writes in general, since we are
> heavily starving those.

Quite possible. Other people also had reported write starvation issues. I 
have got reports of "hung task timeout of 120 seconds" reports in presence
of sync IO happening on same disk/partition.

We probably need to do something about write starvation. I had posted one
patch to make sure we dispatch atleast one WRITE after we were waiting for
pending sync requests to finish.

https://lkml.org/lkml/2011/6/10/326

This might help a bit but might not prevent servere delays in dispatching
async writes as things are so heavily loaded in favor or sync IO.

BTW, in this case, I did not see any sync IO completions happening while
async was not being dispatched. That's little odd.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/