linux-kernel - Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090422033032.GR19637@balbir.in.ibm.com>
Date:	Wed, 22 Apr 2009 09:00:32 +0530
From:	Balbir Singh <balbir@...ux.vnet.ibm.com>
To:	Theodore Tso <tytso@....edu>,
	Andrea Righi <righi.andrea@...il.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Paul Menage <menage@...gle.com>,
	Gui Jianfeng <guijianfeng@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	agk@...rceware.org, akpm@...ux-foundation.org,
	baramsori72@...il.com, Carl Henrik Lunde <chlunde@...g.uio.no>,
	dave@...ux.vnet.ibm.com, Divyesh Shah <dpshah@...gle.com>,
	eric.rannaud@...il.com, fernando@....ntt.co.jp,
	Hirokazu Takahashi <taka@...inux.co.jp>,
	Li Zefan <lizf@...fujitsu.com>, matt@...ehost.com,
	dradford@...ehost.com, ngupta@...gle.com, randy.dunlap@...cle.com,
	roberto@...it.it, Ryo Tsuruta <ryov@...inux.co.jp>,
	Satoshi UCHIDA <s-uchida@...jp.nec.com>,
	subrata@...ux.vnet.ibm.com, yoshikawa.takuya@....ntt.co.jp,
	containers@...ts.linux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

* Theodore Tso <tytso@....edu> [2009-04-21 15:14:01]:

> On Tue, Apr 21, 2009 at 11:44:29PM +0530, Balbir Singh wrote:
> > 
> > That would be true in general, but only the process writing to the
> > file will dirty it. So dirty already accounts for the read/write
> > split. I'd assume that the cost is only for the dirty page, since we
> > do IO only on write in this case, unless I am missing something very
> > obvious.
> 
> Maybe I'm missing something, but the (in development) patches I saw
> seemed to use the existing infrastructure designed for RSS cost
> tracking (which is also not yet in mainline, unless I'm mistaken ---
> but I didn't see page_get_page_cgroup() in the mainline tree yet).
> 
> Right?  So if process A in cgroup A reads touches the file first by
> reading from it, then the pages read by process A will be assigned as
> being "owned" by cgroup A.   Then when the patch described at
> 
>       http://lkml.org/lkml/2008/9/9/245

That is correct, but on reclaim (hitting the limit) a page that is frequently
used by B and not A, can get reclaimed from A and move to B if B is
heavily using it.

> 
> ... tries to charge a write done by process B in cgroup B, the code
> will call page_get_page_cgroup(), see that it is "owned" by cgroup A,
> and charge the dirty page to cgroup A.  If process A and all of the
> other processes in cgroup A only access this file read-only, and
> process B is updating this file very heavily --- and it is a large
> file --- then cgroup B will get a completely free pass as far as
> dirtying pages to this file, since it will be all charged 100% to
> cgroup A, incorrectly.
> 
> So what am I missing?

You are right. As long as A is not exceeding its limit, B will get a
free pass at the page. The page will be inactive on A's LRU and active
on the global LRU though from the memory controller perspective. We'll
need to find a way to fix this, if this is a very common scenario for
the IO controller.

-- 
	Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/