linux-kernel - Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090421001822.GB19186@mit.edu>
Date:	Mon, 20 Apr 2009 20:18:22 -0400
From:	Theodore Tso <tytso@....edu>
To:	Jens Axboe <jens.axboe@...cle.com>,
	Paul Menage <menage@...gle.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Gui Jianfeng <guijianfeng@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	agk@...rceware.org, akpm@...ux-foundation.org,
	baramsori72@...il.com, Carl Henrik Lunde <chlunde@...g.uio.no>,
	dave@...ux.vnet.ibm.com, Divyesh Shah <dpshah@...gle.com>,
	eric.rannaud@...il.com, fernando@....ntt.co.jp,
	Hirokazu Takahashi <taka@...inux.co.jp>,
	Li Zefan <lizf@...fujitsu.com>, matt@...ehost.com,
	dradford@...ehost.com, ngupta@...gle.com, randy.dunlap@...cle.com,
	roberto@...it.it, Ryo Tsuruta <ryov@...inux.co.jp>,
	Satoshi UCHIDA <s-uchida@...jp.nec.com>,
	subrata@...ux.vnet.ibm.com, yoshikawa.takuya@....ntt.co.jp,
	containers@...ts.linux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Fri, Apr 17, 2009 at 04:39:05PM +0200, Andrea Righi wrote:
> 
> Exactly, the purpose here is is to prioritize the dispatching of journal
> IO requests in the IO controller. I may have used an inappropriate flag
> or a quick&dirty solution, but without this, any cgroup/process that
> generates a lot of journal activity may be throttled and cause other
> cgroups/processes to be incorrectly blocked when they try to write to
> disk.

With ext3 and ext4, all journal I/O requests end up going through
kjournald.  So the question is what I/O control group do you put
kjournald in?  If you unrestrict it, it makes the problem go away
entirely.  On the other hand, it is doing work on behalf of other
processes, and there is no real way to separate out on whose behalf
kjournald is doing said work.  So I'm not sure fundamentally you'll be
able to do much with any filesystem journalling activity --- and ext3
makes life especially bad because of data=ordered mode. 

> > I'm assuming it's the "usual" problem with lower priority IO getting
> > access to fs exclusive data. It's quite trivial to cause problems with
> > higher IO priority tasks then getting stuck waiting for the low priority
> > process, since they also need to access that fs exclusive data.
> 
> Right. I thought about using the BIO_RW_SYNC flag instead, but as Ted
> pointed out, some cgroups/processes might be able to evade the IO
> control issuing a lot of fsync()s. We could also limit the fsync()-rate
> into the IO controller, but it sounds like a dirty workaround...

Well, if you use data=writeback or Chris Mason's proposed data=guarded
mode, then at least all of the data blocks will be written process
context of the application, and not kjournald's process context.  So
one solution that might be the best that we have for now is to treat
kjournald as special from an I/O controller point of view (i.e., give
it its own cgroup), and then use a filesystem mode which avoids data
blocks getting written in kjournald (i.e., ext3 data=wirteback or
data=guarded, ext4's delayed allocation, etc.)

One major form of leakage that you're still going to have is pdflush;
which again, is more I/O happening in somebody else's process context.
Ultimately I think trying to throttle I/O at write submission time
whether at entry into block layer or in the elevators, is going to be
highly problematic.  Suppose someone dirties a large number of pages?
That's a system resource, and delaying the writes because a particular
container has used more than its fair share will cause the entire
system to run out of memory, which is not a good thing.

Ultimately, I think you'll need to do is write throttling, and suspend
processes that are dirtying too many pages, instad of trying to
control the I/O.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/