linux-kernel - Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090421204905.GA5573@linux>
Date:	Tue, 21 Apr 2009 22:49:06 +0200
From:	Andrea Righi <righi.andrea@...il.com>
To:	Theodore Tso <tytso@....edu>
Cc:	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	Paul Menage <menage@...gle.com>,
	Gui Jianfeng <guijianfeng@...fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	agk@...rceware.org, akpm@...ux-foundation.org,
	baramsori72@...il.com, Carl Henrik Lunde <chlunde@...g.uio.no>,
	dave@...ux.vnet.ibm.com, Divyesh Shah <dpshah@...gle.com>,
	eric.rannaud@...il.com, fernando@....ntt.co.jp,
	Hirokazu Takahashi <taka@...inux.co.jp>,
	Li Zefan <lizf@...fujitsu.com>, matt@...ehost.com,
	dradford@...ehost.com, ngupta@...gle.com, randy.dunlap@...cle.com,
	roberto@...it.it, Ryo Tsuruta <ryov@...inux.co.jp>,
	Satoshi UCHIDA <s-uchida@...jp.nec.com>,
	subrata@...ux.vnet.ibm.com, yoshikawa.takuya@....ntt.co.jp,
	containers@...ts.linux-foundation.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Tue, Apr 21, 2009 at 03:14:01PM -0400, Theodore Tso wrote:
> On Tue, Apr 21, 2009 at 11:44:29PM +0530, Balbir Singh wrote:
> > 
> > That would be true in general, but only the process writing to the
> > file will dirty it. So dirty already accounts for the read/write
> > split. I'd assume that the cost is only for the dirty page, since we
> > do IO only on write in this case, unless I am missing something very
> > obvious.
> 
> Maybe I'm missing something, but the (in development) patches I saw
> seemed to use the existing infrastructure designed for RSS cost
> tracking (which is also not yet in mainline, unless I'm mistaken ---
> but I didn't see page_get_page_cgroup() in the mainline tree yet).

page_get_page_cgroup() is the old page_cgroup interface, now it has been
replaced by lookup_page_cgroup(), that is in the mainline.

> 
> Right?  So if process A in cgroup A reads touches the file first by
> reading from it, then the pages read by process A will be assigned as
> being "owned" by cgroup A.   Then when the patch described at
> 
>       http://lkml.org/lkml/2008/9/9/245

And this patch must be completely reworked.

> 
> ... tries to charge a write done by process B in cgroup B, the code
> will call page_get_page_cgroup(), see that it is "owned" by cgroup A,
> and charge the dirty page to cgroup A.  If process A and all of the
> other processes in cgroup A only access this file read-only, and
> process B is updating this file very heavily --- and it is a large
> file --- then cgroup B will get a completely free pass as far as
> dirtying pages to this file, since it will be all charged 100% to
> cgroup A, incorrectly.

yep! right. Anyway, it's not completely wrong to account dirty pages in
this way. The dirty pages actually belong to cgroup A and providing per
cgroup upper limits of dirty pages could help to equally distribute
dirty pages, that are hard/slow to reclaim, among cgroups.

But this is definitely another problem.

And it doesn't help for the problem described by Ted, expecially for the
IO controller. The only way I see to correctly handle that case is to
limit the rate of dirty pages per cgroup, accounting the dirty activity
to the cgroup that firstly touched the page (and not the owner as
intended by the memory controller).

And this should be probably strictly connected to the IO controller. If
we throttle or delay the dispatching/submission of some IO requests
without throttling the dirty pages rate a cgroup could completely waste
its own available memory with dirty (hard and slow to reclaim) pages.

That is in part the approach I used in io-throttle v12, adding a hook in
balance_dirty_pages_ratelimited_nr() to throttle the current task when
cgroup's IO limit are exceeded. Argh!

So, another proposal could be to re-add in io-throttle v14 the old hook
also in balance_dirty_pages_ratelimited_nr().

In this way io-throttle would:

- use page_cgroup infrastructure and page_cgroup->flags to encode the
  cgroup id that firstly dirtied a generic page
- account and opportunely throttle sync and writeback IO requests in
  submit_bio()
- at the same time throttle the tasks in
  balance_dirty_pages_ratelimited_nr() if the cgroup they belong has
  exhausted the IO BW (or quota, share, etc. in case of proportional BW
  limit)

-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/