linux-kernel - Re: [RFC] writeback and cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120418075814.GA3809@localhost>
Date:	Wed, 18 Apr 2012 15:58:14 +0800
From:	Fengguang Wu <fengguang.wu@...el.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Tejun Heo <tj@...nel.org>, vgoyal@...hat.com,
	Jens Axboe <axboe@...nel.dk>, linux-mm@...ck.org,
	sjayaraman@...e.com, andrea@...terlinux.com, jmoyer@...hat.com,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	kamezawa.hiroyu@...fujitsu.com, lizefan@...wei.com,
	containers@...ts.linux-foundation.org, cgroups@...r.kernel.org,
	ctalbott@...gle.com, rni@...gle.com, lsf@...ts.linux-foundation.org
Subject: Re: [RFC] writeback and cgroup

On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
> On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
> ...
> > > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > > at the block layer and pressure will be formed there and then
> > > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > > whole information might result in better behavior for certain
> > > > > workloads, but down the road, say, in three or five years, devices
> > > > > which can be shared without worrying too much about seeks might be
> > > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > > and sadly various cgroup support seems to be a prominent source of
> > > > > such design failures.
> > > > 
> > > > Super fast storages are coming which will make us regret to make the
> > > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > > I doubt Google is willing to afford the disk seek costs on its
> > > > millions of disks and has the patience to wait until switching all of
> > > > the spin disks to SSD years later (if it will ever happen).
> > > 
> > > This is new.  Let's keep the damn employer out of the discussion.
> > > While the area I work on is affected by my employment (writeback isn't
> > > even my area BTW), I'm not gonna do something adverse to upstream even
> > > if it's beneficial to google and I'm much more likely to do something
> > > which may hurt google a bit if it's gonna benefit upstream.
> > > 
> > > As for the faster / newer storage argument, that is *exactly* why we
> > > want to keep the layering proper.  Writeback works from the pressure
> > > from the IO stack.  If IO technology changes, we update the IO stack
> > > and writeback still works from the pressure.  It may need to be
> > > adjusted but the principles don't change.
> > 
> > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > It's always there doing 1:1 proportional throttling. Then you try to
> > kick in to add *double* throttling in block/cfq layer. Now the low
> > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > from its balanced state, leading to large fluctuations and program
> > stalls.  This can be avoided by telling balance_dirty_pages(): "your
> > balance goal is no longer 1:1, but 10:1". With this information
> > balance_dirty_pages() will behave right. Then there is the question:
> > if balance_dirty_pages() will work just well provided the information,
> > why bother doing the throttling at low layer and "push back" the
> > pressure all the way up?
>   Fengguang, maybe we should first agree on some basics:
>   The two main goals of balance_dirty_pages() are (and always have been
> AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
> in memory to allow for efficient writeback. Secondary goals are to also
> keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Agreed. In fact, before the IO-less change, balance_dirty_pages() had
no much explicit control over the dirty rate and fairness.

> Thus shift to trying to control *IO throughput* (or even just buffered
> write throughput) from balance_dirty_pages() is a fundamental shift in the
> goals of balance_dirty_pages(), not just some tweak (although technically,
> it might be relatively easy to do for buffered writes given the current
> implementation).

Yes, it has been a bit shift to the rate based dirty control.

> ...
> > > Well, I tried and I hope some of it got through.  I also wrote a lot
> > > of questions, mainly regarding how what you have in mind is supposed
> > > to work through what path.  Maybe I'm just not seeing what you're
> > > seeing but I just can't see where all the IOs would go through and
> > > come together.  Can you please elaborate more on that?
> > 
> > What I can see is, it looks pretty simple and nature to let
> > balance_dirty_pages() fill the gap towards a total solution :-)
> > 
> > - add direct IO accounting in some convenient point of the IO path
> >   IO submission or completion point, either is fine.
> > 
> > - change several lines of the buffered write IO controller to
> >   integrate the direct IO rate into the formula to fit the "total
> >   IO" limit
> > 
> > - in future, add more accounting as well as feedback control to make
> >   balance_dirty_pages() work with IOPS and disk time
>   Sorry Fengguang but I also think this is a wrong way to go.
> balance_dirty_pages() must primarily control the amount of dirty pages.
> Trying to bend it to control IO throughput by including direct IO and
> reads in the accounting will just make the logic even more complex than it
> already is.

Right, I have been adding too much complexity to balance_dirty_pages().
The control algorithms are pretty hard to understand and get right for
all cases.

OK, I'll post results of my experiments up to now, answer some
questions and take a comfortable break. Phooo..

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/