linux-kernel - Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090424092609.aa1da56a.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Fri, 24 Apr 2009 09:26:09 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Andrea Righi <righi.andrea@...il.com>
Cc:	Theodore Tso <tytso@....edu>, akpm@...ux-foundation.org,
	randy.dunlap@...cle.com, Carl Henrik Lunde <chlunde@...g.uio.no>,
	Jens Axboe <jens.axboe@...cle.com>, eric.rannaud@...il.com,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	fernando@....ntt.co.jp, dradford@...ehost.com,
	Gui@...p1.linux-foundation.org, agk@...rceware.org,
	subrata@...ux.vnet.ibm.com, Paul Menage <menage@...gle.com>,
	containers@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, dave@...ux.vnet.ibm.com,
	matt@...ehost.com, roberto@...it.it, ngupta@...gle.com
Subject: Re: [PATCH 9/9] ext3: do not throttle metadata and journal IO

On Thu, 23 Apr 2009 23:13:04 +0200
Andrea Righi <righi.andrea@...il.com> wrote:

> On Thu, Apr 23, 2009 at 08:17:45AM -0400, Theodore Tso wrote:
> > On Thu, Apr 23, 2009 at 11:44:24AM +0200, Andrea Righi wrote:
> > > This is true in part. Actually io-throttle v12 has been largely tested,
> > > also in production environments (Matt and David in cc can confirm
> > > this) with quite interesting results.
> > > 
> > > I tested the previous versions usually with many parallel iozone, dd,
> > > using many different configurations.
> > > 
> > > In v12 writeback IO is not actually limited, what io-throttle did was to
> > > account and limit reads and direct IO in submit_bio() and limit and
> > > account page cache writes in balance_dirty_pages_ratelimited_nr().
> > 
> > Did the testing include what happened if the system was also
> > simultaneously under memory pressure?  What you might find happening
> > then is that the cgroups which have lots of dirty pages, which are not
> > getting written out, have their memory usage "protected", while
> > cgroups that have lots of clean pages have more of their pages
> > (unfairly) evicted from memory.  The worst case, of course, would be
> > if the memory pressure is coming from an uncapped cgroup.
> 
> This is an interesting case that should be considered of course. The
> tests I did were mainly focused in distinct environment where each
> cgroup writes its own files and dirties its own memory. I'll add this
> case to the next tests I'll do with io-throttle.
> 
> But it's a general problem IMHO and doesn't depend only on the presence
> of an IO controller. The same issue can happen if a cgroup reads a file
> from a slow device and another cgroup writes to all the pages of the
> other cgroup.
> 
> Maybe this kind of cgroup unfairness should be addressed by the memory
> controller, the IO controller should be just like another slow device in
> this particular case.
> 
"soft limit"...for selecting victim at memory shortage is under development.


> >    
> > So that's basically the same worry I have; which is we're looking at
> > things at a too-low-level basis, and not at the big picture.
> > 
> > There wasn't discussion about the I/O controller on this thread at
> > all, at least as far as I could find; nor that splitting the problem
> > was the right way to solve the problem.  Maybe somewhere there was a
> > call for someone to step back and take a look at the "big picture"
> > (what I've been calling the high level design), but I didn't see it in
> > the thread.
> > 
> > It would seem to be much simpler if there was a single tuning knob for
> > the I/O controller and for dirty page writeback --- after all, why
> > *else* would you be trying to control the rate at which pages get
> > dirty?  And if you have a cgroup which sometimes does a lot of writes
> 
> Actually we do already control the rate at which dirty pages are
> generated. In balance_dirty_pages() we add a congestion_wait() when the
> bdi is congested.
> 
> We do that when we write to a slow device for example. Slow because it
> is intrinsically slow or because it is limited by some IO controlling
> rules.
> 
> It is a very similar issue IMHO.
> 
I think so, too.

> > via direct I/O, and sometimes does a lot of writes through the page
> > cache, and sometimes does *both*, it would seem to me that if you want
> > to be able to smoothly limit the amount of I/O it does, you would want
> > to account and charge for direct I/O and page cache I/O under the same
> > "bucket".   Is that what the user would want?   
> > 
> > Suppose you only have 200 MB/sec worth of disk bandwidth, and you
> > parcel it out in 50 MB/sec chunks to 4 cgroups.  But you also parcel
> > out 50MB/sec of dirty writepages quota to each of the 4 cgroups. 

50MB/sec of diry writepages sounds strange. It's just "50MB of dirty pages limit".
not 50MB/sec if we use a logic like dirty_ratio.


> > Now suppose one of the cgroups, which was normally doing not much of
> > anything, suddenly starts doing a database backup which does 50 MB/sec
> > of direct I/O reading from the database file, and 50 MB/sec dirtying
> > pages in the page cache as it writes the backup file.  Suddenly that
> > one cgroup is using half of the system's I/O bandwidth!
> 
Hmm ? buffered I/O tracking can't be a help ? Of course, I/O controller 
should chase this. And dirty_ratio is not 50MB/sec but 50MB. Then,
read will get slow down very soon if read/write is done by 1 thread.
(I'm not sure if there are 2 threads, one only read and another only write.)

BTW, read B/W and write B/W can be handled under a limit ?


> Agreed. The bucket should be the same. The dirty memory should be
> probably limited only in terms of "space" for this case instead of BW.
> 
> And we should guarantee that a cgroup doesn't fill unfairly the memory
> with dirty pages (system-wide or in other cgroups).
> 
> > 
> > And before you say this is "correct" from a definitional point of
> > view, is it "correct" from what a system administrator would want to
> > control?  Is it the right __feature__?  If you just say, well, we
> > defined the problem that way, and we're doing things the way we
> > defined it, that's a case of garbage in, garbage out.  You also have
> > to ask the question, "did we define the _problem_ in the right way?"
> > What does the user of this feature really want to do?  
> > 
> > It would seem to me that the system administrator would want a single
> > knob, saying "I don't know or care how the processes in a cgroup does
> > its I/O; I just want to limit things so that the cgroup can only hog
> > 25% of the I/O bandwidth."
> 
> Agreed.
> 
Agreed. It will be the best.

> > 
> > And note this is completely separate from the question of what happens
> > if you throttle I/O in the page cache writeback loop, and you end up
> > with an imbalance in the clean/dirty ratios of the cgroups. 
dirty_ratio for memcg is in plan. just delayed.

> > And
> > looking at this thread, life gets even *more* amusing on NUMA machines
> > if you do this; what if you end up starving a cpuset as a result of
> > this I/O balancing decision, so a particular cpuset doesn't have
> > enough memory?  That's when you'll *definitely* start having OOM
> > problems.
> > 
cpuset users shouldn't use I/O limiting, in general.
Or I/O cotroller should have a switch as "toggle I/O limit if I/O is from
kswapd/vmscan.c". (Or categorize it to kernel I/O.)


> Honestly, I've never considered the cgroups "interactions" and the
> unfair distribution of dirty pages among cgroups, for example, as
> correctly pointed out by Ted.
> 

If we really want that, scheduler-cgroup should be considered, too.

Considering optimisically, 99% of cgroup users will use "container" and
all resource control cgroup will be set up at once. Then, user-land
container tools can tell users the container has good balance(of cpu,memory,I/O, etc)
or not.

_interactions_ is important. But cgroup is desined to have many independent subsystems
because it's considered as generic infrastructure.
I didn't read the cgroup design discussion but it's strange to say  "we need
balance under subsystem in the kernel" _now_.

A container, the user interface of cgroups which most people think of, should
know that. If we can't do in user land, we should find a way to _ineractions_ in
the kernel, of course.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/