linux-kernel - Re: IO scheduler based IO controller V10

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090925052911.GK4590@balbir.in.ibm.com>
Date:	Fri, 25 Sep 2009 10:59:12 +0530
From:	Balbir Singh <balbir@...ux.vnet.ibm.com>
To:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Vivek Goyal <vgoyal@...hat.com>, linux-kernel@...r.kernel.org,
	jens.axboe@...cle.com, containers@...ts.linux-foundation.org,
	dm-devel@...hat.com, nauman@...gle.com, dpshah@...gle.com,
	lizf@...fujitsu.com, mikew@...gle.com, fchecconi@...il.com,
	paolo.valente@...more.it, ryov@...inux.co.jp,
	fernando@....ntt.co.jp, s-uchida@...jp.nec.com, taka@...inux.co.jp,
	guijianfeng@...fujitsu.com, jmoyer@...hat.com,
	dhaval@...ux.vnet.ibm.com, righi.andrea@...il.com,
	m-ikeda@...jp.nec.com, agk@...hat.com, peterz@...radead.org,
	jmarchan@...hat.com, torvalds@...ux-foundation.org, mingo@...e.hu,
	riel@...hat.com
Subject: Re: IO scheduler based IO controller V10

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> [2009-09-25 10:18:21]:

> On Fri, 25 Sep 2009 10:09:52 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> 
> > On Thu, 24 Sep 2009 14:33:15 -0700
> > Andrew Morton <akpm@...ux-foundation.org> wrote:
> > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > > ===================================================================
> > > > Fairness for async writes is tricky and biggest reason is that async writes
> > > > are cached in higher layers (page cahe) as well as possibly in file system
> > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > > in proportional manner.
> > > > 
> > > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > > service differentation.
> > > > 
> > > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > > does not throw enought IO traffic at IO controller to keep the queue
> > > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > > intervals where higher weight queue is empty and in that duration lower weight
> > > > queue get lots of job done giving the impression that there was no service
> > > > differentiation.
> > > > 
> > > > In summary, from IO controller point of view async writes support is there.
> > > > Because page cache has not been designed in such a manner that higher 
> > > > prio/weight writer can do more write out as compared to lower prio/weight
> > > > writer, gettting service differentiation is hard and it is visible in some
> > > > cases and not visible in some cases.
> > > 
> > > Here's where it all falls to pieces.
> > > 
> > > For async writeback we just don't care about IO priorities.  Because
> > > from the point of view of the userspace task, the write was async!  It
> > > occurred at memory bandwidth speed.
> > > 
> > > It's only when the kernel's dirty memory thresholds start to get
> > > exceeded that we start to care about prioritisation.  And at that time,
> > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > > consumes just as much memory as a low-ioprio dirty page.
> > > 
> > > So when balance_dirty_pages() hits, what do we want to do?
> > > 
> > > I suppose that all we can do is to block low-ioprio processes more
> > > agressively at the VFS layer, to reduce the rate at which they're
> > > dirtying memory so as to give high-ioprio processes more of the disk
> > > bandwidth.
> > > 
> > > But you've gone and implemented all of this stuff at the io-controller
> > > level and not at the VFS level so you're, umm, screwed.
> > > 
> > 
> > I think I must support dirty-ratio in memcg layer. But not yet.
> 

We need to add this to the TODO list.

> OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> And add a control knob as
>   bufferred_write.nr_dirty_thresh
> to limit the number of dirty pages generetad via a cgroup.
> 
> Because memcg just records a owner of pages but not records who makes them
> dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> cgroup code.

Very good point, this is crucial for shared pages.

> 
> But I'm not sure how I should treat I/Os generated out by kswapd.
>

Account them to process 0 :)

-- 
	Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/