linux-kernel - Re: [RFC] writeback and cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120418065720.GA21485@quack.suse.cz>
Date:	Wed, 18 Apr 2012 08:57:20 +0200
From:	Jan Kara <jack@...e.cz>
To:	Fengguang Wu <fengguang.wu@...el.com>
Cc:	Tejun Heo <tj@...nel.org>, Jan Kara <jack@...e.cz>,
	vgoyal@...hat.com, Jens Axboe <axboe@...nel.dk>,
	linux-mm@...ck.org, sjayaraman@...e.com, andrea@...terlinux.com,
	jmoyer@...hat.com, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, kamezawa.hiroyu@...fujitsu.com,
	lizefan@...wei.com, containers@...ts.linux-foundation.org,
	cgroups@...r.kernel.org, ctalbott@...gle.com, rni@...gle.com,
	lsf@...ts.linux-foundation.org
Subject: Re: [RFC] writeback and cgroup

On Fri 06-04-12 02:59:34, Wu Fengguang wrote:
...
> > > > Let's please keep the layering clear.  IO limitations will be applied
> > > > at the block layer and pressure will be formed there and then
> > > > propagated upwards eventually to the originator.  Sure, exposing the
> > > > whole information might result in better behavior for certain
> > > > workloads, but down the road, say, in three or five years, devices
> > > > which can be shared without worrying too much about seeks might be
> > > > commonplace and we could be swearing at a disgusting structural mess,
> > > > and sadly various cgroup support seems to be a prominent source of
> > > > such design failures.
> > > 
> > > Super fast storages are coming which will make us regret to make the
> > > IO path over complex.  Spinning disks are not going away anytime soon.
> > > I doubt Google is willing to afford the disk seek costs on its
> > > millions of disks and has the patience to wait until switching all of
> > > the spin disks to SSD years later (if it will ever happen).
> > 
> > This is new.  Let's keep the damn employer out of the discussion.
> > While the area I work on is affected by my employment (writeback isn't
> > even my area BTW), I'm not gonna do something adverse to upstream even
> > if it's beneficial to google and I'm much more likely to do something
> > which may hurt google a bit if it's gonna benefit upstream.
> > 
> > As for the faster / newer storage argument, that is *exactly* why we
> > want to keep the layering proper.  Writeback works from the pressure
> > from the IO stack.  If IO technology changes, we update the IO stack
> > and writeback still works from the pressure.  It may need to be
> > adjusted but the principles don't change.
> 
> To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> It's always there doing 1:1 proportional throttling. Then you try to
> kick in to add *double* throttling in block/cfq layer. Now the low
> layer may enforce 10:1 throttling and push balance_dirty_pages() away
> from its balanced state, leading to large fluctuations and program
> stalls.  This can be avoided by telling balance_dirty_pages(): "your
> balance goal is no longer 1:1, but 10:1". With this information
> balance_dirty_pages() will behave right. Then there is the question:
> if balance_dirty_pages() will work just well provided the information,
> why bother doing the throttling at low layer and "push back" the
> pressure all the way up?
  Fengguang, maybe we should first agree on some basics:
  The two main goals of balance_dirty_pages() are (and always have been
AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages
in memory to allow for efficient writeback. Secondary goals are to also
keep amount of dirty pages somewhat fair among bdis and processes. Agreed?

Thus shift to trying to control *IO throughput* (or even just buffered
write throughput) from balance_dirty_pages() is a fundamental shift in the
goals of balance_dirty_pages(), not just some tweak (although technically,
it might be relatively easy to do for buffered writes given the current
implementation).

...
> > Well, I tried and I hope some of it got through.  I also wrote a lot
> > of questions, mainly regarding how what you have in mind is supposed
> > to work through what path.  Maybe I'm just not seeing what you're
> > seeing but I just can't see where all the IOs would go through and
> > come together.  Can you please elaborate more on that?
> 
> What I can see is, it looks pretty simple and nature to let
> balance_dirty_pages() fill the gap towards a total solution :-)
> 
> - add direct IO accounting in some convenient point of the IO path
>   IO submission or completion point, either is fine.
> 
> - change several lines of the buffered write IO controller to
>   integrate the direct IO rate into the formula to fit the "total
>   IO" limit
> 
> - in future, add more accounting as well as feedback control to make
>   balance_dirty_pages() work with IOPS and disk time
  Sorry Fengguang but I also think this is a wrong way to go.
balance_dirty_pages() must primarily control the amount of dirty pages.
Trying to bend it to control IO throughput by including direct IO and
reads in the accounting will just make the logic even more complex than it
already is.

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/