[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120926202247.GA20920@quack.suse.cz>
Date: Wed, 26 Sep 2012 22:22:47 +0200
From: Jan Kara <jack@...e.cz>
To: Fengguang Wu <fengguang.wu@...el.com>
Cc: Jan Kara <jack@...e.cz>, Namjae Jeon <linkinjeon@...il.com>,
Dave Chinner <david@...morbit.com>,
linux-kernel@...r.kernel.org,
Namjae Jeon <namjae.jeon@...sung.com>,
Vivek Trivedi <t.vivek@...sung.com>,
Linux Memory Management List <linux-mm@...ck.org>,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per
bdi variable
On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > > From: Namjae Jeon <namjae.jeon@...sung.com>
> > > >
> > > > This patch is based on suggestion by Wu Fengguang:
> > > > https://lkml.org/lkml/2011/8/19/19
> > > >
> > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > >
> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > > writecache to control per bdi dirty limits and task throttling.
> > > >
> > > > However, there might be a usecase where user wants a per bdi writeback tuning
> > > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > > especially at NFS server.
> > > >
> > > > dirty_background_centisecs provides an interface where user can tune
> > > > background writeback start threshold using
> > > > /sys/block/sda/bdi/dirty_background_centisecs
> > > >
> > > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > > estimation to start background writeback.
> > The functionality you describe, i.e. start flushing bdi when there's
> > reasonable amount of dirty data on it, looks sensible and useful. However
> > I'm not so sure whether the interface you propose is the right one.
> > Traditionally, we allow user to set amount of dirty data (either in bytes
> > or percentage of memory) when background writeback should start. You
> > propose setting the amount of data in centisecs-to-write. Why that
> > difference? Also this interface ties our throughput estimation code (which
> > is an implementation detail of current dirty throttling) with the userspace
> > API. So we'd have to maintain the estimation code forever, possibly also
> > face problems when we change the estimation code (and thus estimates in
> > some cases) and users will complain that the values they set originally no
> > longer work as they used to.
>
> Yes, that bandwidth estimation is not all that (and in theory cannot
> be made) reliable which may be a surprise to the user. Which make the
> interface flaky.
>
> > Also, as with each knob, there's a problem how to properly set its value?
> > Most admins won't know about the knob and so won't touch it. Others might
> > know about the knob but will have hard time figuring out what value should
> > they set. So if there's a new knob, it should have a sensible initial
> > value. And since this feature looks like a useful one, it shouldn't be
> > zero.
>
> Agreed in principle. There seems be no reasonable defaults for the
> centisecs-to-write interface, mainly due to its inaccurate nature,
> especially the initial value may be wildly wrong on fresh system
> bootup. This is also true for your proposed interfaces, see below.
>
> > So my personal preference would be to have bdi->dirty_background_ratio and
> > bdi->dirty_background_bytes and start background writeback whenever
> > one of global background limit and per-bdi background limit is exceeded. I
> > think this interface will do the job as well and it's easier to maintain in
> > future.
>
> bdi->dirty_background_ratio, if I understand its semantics right, is
> unfortunately flaky in the same principle as centisecs-to-write,
> because it relies on the (implicitly estimation of) writeout
> proportions. The writeout proportions for each bdi starts with 0,
> which is even worse than the 100MB/s initial value for
> bdi->write_bandwidth and will trigger background writeback on the
> first write.
Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
dirtiable_memory * bdi->dirty_backround_ratio.
After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.
> bdi->dirty_background_bytes is, however, reliable, and gives users
> total control. If we export this interface alone, I'd imagine users
> who want to control centisecs-to-write could run a simple script to
> periodically get the write bandwith value out of the existing bdi
> interface and echo it into bdi->dirty_background_bytes. Which makes
> simple yet good enough centisecs-to-write controlling.
>
> So what do you think about exporting a really dumb
> bdi->dirty_background_bytes, which will effectively give smart users
> the freedom to do smart control over per-bdi background writeback
> threshold? The users are offered the freedom to do his own bandwidth
> estimation and choose not to rely on the kernel estimation, which will
> free us from the burden of maintaining a flaky interface as well. :)
That's fine with me. Just it would be nice if we gave
bdi->dirty_background_bytes some useful initial value. Maybe like
dirtiable_memory * dirty_background_ratio?
Honza
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists