linux-ext4 - Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1C0A2FC8-620C-4AFE-A921-35EDAC377BD4@linaro.org>
Date:   Mon, 20 May 2019 12:45:58 +0200
From:   Paolo Valente <paolo.valente@...aro.org>
To:     Jan Kara <jack@...e.cz>
Cc:     Theodore Ts'o <tytso@....edu>,
        "Srivatsa S. Bhat" <srivatsa@...il.mit.edu>,
        linux-fsdevel@...r.kernel.org,
        linux-block <linux-block@...r.kernel.org>,
        linux-ext4@...r.kernel.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, axboe@...nel.dk, jmoyer@...hat.com,
        amakhalov@...are.com, anishs@...are.com, srivatsab@...are.com
Subject: Re: CFQ idling kills I/O performance on ext4 with blkio cgroup
 controller



> Il giorno 20 mag 2019, alle ore 11:15, Jan Kara <jack@...e.cz> ha scritto:
> 
> On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
>> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for
>>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>>> still see the problem, then I'll be glad to reproduce it, and
>>> hopefully fix it for you.
>> 
>> Hi Paolo, I'm curious if you could give a quick summary about what you
>> changed in BFQ?
>> 
>> I was considering adding support so that if userspace calls fsync(2)
>> or fdatasync(2), to attach the process's CSS to the transaction, and
>> then charge all of the journal metadata writes the process's CSS.  If
>> there are multiple fsync's batched into the transaction, the first
>> process which forced the early transaction commit would get charged
>> the entire journal write.  OTOH, journal writes are sequential I/O, so
>> the amount of disk time for writing the journal is going to be
>> relatively small, and especially, the fact that work from other
>> cgroups is going to be minimal, especially if hadn't issued an
>> fsync().
> 
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?
> 
>> In the case where you have three cgroups all issuing fsync(2) and they
>> all landed in the same jbd2 transaction thanks to commit batching, in
>> the ideal world we would split up the disk time usage equally across
>> those three cgroups.  But it's probably not worth doing that...
>> 
>> That being said, we probably do need some BFQ support, since in the
>> case where we have multiple processes doing buffered writes w/o fsync,
>> we do charnge the data=ordered writeback to each block cgroup. Worse,
>> the commit can't complete until the all of the data integrity
>> writebacks have completed.  And if there are N cgroups with dirty
>> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
>> of idle time tacked onto the commit time.
> 
> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time.

Yep.  Issues like this are targeted exactly by the improvement I
mentioned in my previous reply.

> But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.
> 

That's an interesting point.  So far, I've assumed that nobody would
have told anything to BFQ.  But if you guys think that such a
communication may be acceptable at some degree, then I'd be glad to
try to come up with some solution.  For instance: some hook that any
I/O scheduler may export if meaningful.

Thanks,
Paolo

> 								Honza
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR


Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)