linux-kernel - Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090421083001.GA8441@linux>
Date:	Tue, 21 Apr 2009 10:30:02 +0200
From:	Andrea Righi <righi.andrea@...il.com>
To:	Theodore Tso <tytso@....edu>
Cc:	Jens Axboe <jens.axboe@...cle.com>,
	Paul Menage <menage@...gle.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Gui Jianfeng <guijianfeng@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	agk@...rceware.org, akpm@...ux-foundation.org,
	baramsori72@...il.com, Carl Henrik Lunde <chlunde@...g.uio.no>,
	dave@...ux.vnet.ibm.com, Divyesh Shah <dpshah@...gle.com>,
	eric.rannaud@...il.com, fernando@....ntt.co.jp,
	Hirokazu Takahashi <taka@...inux.co.jp>,
	Li Zefan <lizf@...fujitsu.com>, matt@...ehost.com,
	dradford@...ehost.com, ngupta@...gle.com, randy.dunlap@...cle.com,
	roberto@...it.it, Ryo Tsuruta <ryov@...inux.co.jp>,
	Satoshi UCHIDA <s-uchida@...jp.nec.com>,
	subrata@...ux.vnet.ibm.com, yoshikawa.takuya@....ntt.co.jp,
	containers@...ts.linux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Mon, Apr 20, 2009 at 08:18:22PM -0400, Theodore Tso wrote:
> On Fri, Apr 17, 2009 at 04:39:05PM +0200, Andrea Righi wrote:
> > 
> > Exactly, the purpose here is is to prioritize the dispatching of journal
> > IO requests in the IO controller. I may have used an inappropriate flag
> > or a quick&dirty solution, but without this, any cgroup/process that
> > generates a lot of journal activity may be throttled and cause other
> > cgroups/processes to be incorrectly blocked when they try to write to
> > disk.
> 
> With ext3 and ext4, all journal I/O requests end up going through
> kjournald.  So the question is what I/O control group do you put
> kjournald in?  If you unrestrict it, it makes the problem go away
> entirely.  On the other hand, it is doing work on behalf of other
> processes, and there is no real way to separate out on whose behalf
> kjournald is doing said work.  So I'm not sure fundamentally you'll be
> able to do much with any filesystem journalling activity --- and ext3
> makes life especially bad because of data=ordered mode. 

OK, I've just removed the ext3/ext4 patch from io-throttle v14 and
results are pretty the same. BTW I can't even prioritize all the
BIO_RW_SYNC, because in this way all the direct IO would be never
limited at all. Or at least I should add something like a
is_in_direct_io() check or kind of.

Anyway, I agree and I think it's reasonable to always leave kiojournald
into the root cgroup, and doesn't set any IO limit for that cgroup.

But I wouldn't add additional checks for this, at the end we know that
"Unix gives you just enough rope to hang yourself".

> 
> > > I'm assuming it's the "usual" problem with lower priority IO getting
> > > access to fs exclusive data. It's quite trivial to cause problems with
> > > higher IO priority tasks then getting stuck waiting for the low priority
> > > process, since they also need to access that fs exclusive data.
> > 
> > Right. I thought about using the BIO_RW_SYNC flag instead, but as Ted
> > pointed out, some cgroups/processes might be able to evade the IO
> > control issuing a lot of fsync()s. We could also limit the fsync()-rate
> > into the IO controller, but it sounds like a dirty workaround...
> 
> Well, if you use data=writeback or Chris Mason's proposed data=guarded
> mode, then at least all of the data blocks will be written process
> context of the application, and not kjournald's process context.  So
> one solution that might be the best that we have for now is to treat
> kjournald as special from an I/O controller point of view (i.e., give
> it its own cgroup), and then use a filesystem mode which avoids data
> blocks getting written in kjournald (i.e., ext3 data=wirteback or
> data=guarded, ext4's delayed allocation, etc.)

Agree.

> 
> One major form of leakage that you're still going to have is pdflush;
> which again, is more I/O happening in somebody else's process context.
> Ultimately I think trying to throttle I/O at write submission time
> whether at entry into block layer or in the elevators, is going to be
> highly problematic.  Suppose someone dirties a large number of pages?
> That's a system resource, and delaying the writes because a particular
> container has used more than its fair share will cause the entire
> system to run out of memory, which is not a good thing.
> 
> Ultimately, I think you'll need to do is write throttling, and suspend
> processes that are dirtying too many pages, instad of trying to
> control the I/O.

We're trying to address also this issue, setting max dirty pages limit
per cgroup, and force a direct writeback when these limits are exceeded.

In this case dirty ratio throttling should happen automatically because
the process will be throttled by the IO controller when it tries to
writeback the dirty pages and submit IO requests.

What's your opinion?

Thanks,
-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/