linux-kernel - Re: [PATCH 3/9] writeback: bdi write bandwidth estimation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110701145831.GB8475@localhost>
Date:	Fri, 1 Jul 2011 22:58:31 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Jan Kara <jack@...e.cz>
Cc:	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	Dave Chinner <david@...morbit.com>,
	Christoph Hellwig <hch@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Li, Shaohua" <shaohua.li@...el.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/9] writeback: bdi write bandwidth estimation

Hi Jan,

On Fri, Jul 01, 2011 at 03:56:09AM +0800, Jan Kara wrote:
>   Hello,
> 
> On Wed 29-06-11 22:52:48, Wu Fengguang wrote:
> > The estimation value will start from 100MB/s and adapt to the real
> > bandwidth in seconds.
> > 
> > It tries to update the bandwidth only when disk is fully utilized.
> > Any inactive period of more than one second will be skipped.
> > 
> > The estimated bandwidth will be reflecting how fast the device can
> > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > The value will remain constant at disk idle time. At busy write time, if
> > not considering fluctuations, it will also remain high unless be knocked
> > down by possible concurrent reads that compete for the disk time and
> > bandwidth with async writes.
> > 
> > The estimation is not done purely in the flusher because there is no
> > guarantee for write_cache_pages() to return timely to update bandwidth.
> > 
> > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > out sudden spikes, however may be a little biased in long term.
> > 
> > The overheads are low because the bdi bandwidth update only occurs at
> > 200ms intervals.
> > 
> > The 200ms update interval is suitable, becuase it's not possible to get
> > the real bandwidth for the instance at all, due to large fluctuations.
> > 
> > The NFS commits can be as large as seconds worth of data. One XFS
> > completion may be as large as half second worth of data if we are going
> > to increase the write chunk to half second worth of data. In ext4,
> > fluctuations with time period of around 5 seconds is observed. And there
> > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > 
> > That's why we are not only doing the estimation at 200ms intervals, but
> > also averaging them over a period of 3 seconds and then go further to do
> > another level of smoothing in avg_write_bandwidth.
>   I was thinking about your formulas and also observing how it behaves when
> writeback happens while the disk is loaded with other load as well (e.g.
> grep -r of a large tree or cp from another partition).
> 
> I agree that some kind of averaging is needed. But how we average depends
> on what do we need the number for. My thoughts were that there is not such
> a thing as *the* write bandwidth since that depends on the background load
> on the disk and also type of writes we do (sequential, random) as you noted
> as well. What writeback needs to estimate in fact is "how fast can we write
> this bulk of data?". Since we should ultimately size dirty limits and other
> writeback tunables so that the bulk of data can be written in order of
> seconds (but maybe much slower because of background load) I agree with
> your first choice that we should measure written pages in a window of
> several seconds - so your choice of 3 seconds is OK - but this number
> should have a reasoning attached to it in a comment (something like my
> elaborate above ;)

Agree totally and thanks for the reasoning in great details ;)

> Now to your second level of smoothing - is it really useful? We already

It's useful for filtering out sudden disturbances. Oh I forgot to show
the SSD case which see sudden drops of throughput:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png

It's also very effective for XFS (see the below graph).

> average over several second window so that should really eliminate big
> spikes comming from bumpy IO completion from storage (NFS might be a
> separate question but we talked about that already). Your second level
> of smoothing just spreads the window even further but if you find that
> necessary, why not make the window larger in the first place? Also my

Because they are two different type of smoothing. I employed the same
smoothing as avg_write_bandwidth for bdi_dirty, where I wrote this
comment to illustrate its unique usefulness:

+       /*
+        * This ...
+        * ... is most effective on XFS,
+        * whose pattern is
+        *                                                             .
+        *      [.] dirty       [-] avg                       .       .
+        *                                                   .       .
+        *              .         .         .         .     .       .
+        *      ---------------------------------------    .       .
+        *            .         .         .         .     .       .
+        *           .         .         .         .     .       .
+        *          .         .         .         .     .       .
+        *         .         .         .         .     .       .
+        *        .         .         .         .
+        *       .         .         .         .      (fluctuated)
+        *      .         .         .         .
+        *     .         .         .         .
+        *
+        * @avg will remain flat at the cost of being biased towards high. In
+        * practice the error tend to be much smaller: thanks to more coarse
+        * grained fluctuations, @avg becomes the real average number for the
+        * last two rising lines of @dirty.
+        */

> observations of avg-bandwidth and bandwidth when there's some background
> read load (e.g. from grep) shows that in this case both bandwidth and
> avg-bandwidth fluctuate +-10%. 

You can more easily see their fluctuate ranges in my graphs :)
For example,

(compare the YELLOW line to the RED dots)

NFS
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png

XFS
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/balance_dirty_pages-bandwidth.png

ext4
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png

> Finally, as I reasoned above, we are
> interested in how much we can write in coming say 2 seconds so I don't see
> a big point smoothing the value too much (I'd even say it's undesirable).
> What do you think?

Yeah we typically get 20% ->write_bandwidth fluctuation range in the
above graphs (except for NFS), which seems reasonably good for guiding
the write chunk size.

However I would still like to favor a more stable value and
->avg_write_bandwidth looks appealing with much smaller 3%-10% ranges
in steady state. Because the fluctuations of estimated write bandwidth
will directly become the fluctuations in write chunk sizes over time
as well as the application throttled write speed in future IO-less
balance_dirty_pages().
 
> Some code related comments are below:
> 
> > --- linux-next.orig/fs/fs-writeback.c	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-06-29 16:23:05.000000000 +0800
> > @@ -693,6 +693,16 @@ static inline bool over_bground_thresh(v
> >  }
> >  
> >  /*
> > + * Called under wb->list_lock. If there are multiple wb per bdi,
> > + * only the flusher working on the first wb should do it.
> > + */
> > +static void wb_update_bandwidth(struct bdi_writeback *wb,
> > +				unsigned long start_time)
> > +{
> > +	__bdi_update_bandwidth(wb->bdi, start_time);
> > +}
> > +
> > +/*
> >   * Explicit flushing or periodic writeback of "old" data.
> >   *
> >   * Define "old": the first time one of an inode's pages is dirtied, we mark the
> > @@ -710,6 +720,7 @@ static inline bool over_bground_thresh(v
> >  static long wb_writeback(struct bdi_writeback *wb,
> >  			 struct wb_writeback_work *work)
> >  {
> > +	unsigned long wb_start = jiffies;
> >  	long nr_pages = work->nr_pages;
> >  	unsigned long oldest_jif;
> >  	struct inode *inode;
> > @@ -758,6 +769,8 @@ static long wb_writeback(struct bdi_writ
> >  			progress = __writeback_inodes_wb(wb, work);
> >  		trace_writeback_written(wb->bdi, work);
> >  
> > +		wb_update_bandwidth(wb, wb_start);
> > +
> >  		/*
> >  		 * Did we write something? Try for more
> >  		 *
> > --- linux-next.orig/include/linux/backing-dev.h	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/include/linux/backing-dev.h	2011-06-29 16:23:05.000000000 +0800
> > @@ -73,6 +73,11 @@ struct backing_dev_info {
> >  
> >  	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
> >  
> > +	unsigned long bw_time_stamp;
> > +	unsigned long written_stamp;
> > +	unsigned long write_bandwidth;
> > +	unsigned long avg_write_bandwidth;
> > +
>   Some comments on what these are would be useful.

Sure:

        unsigned long bw_time_stamp;    /* last time write bw is updated */
        unsigned long written_stamp;    /* pages written at bw_time_stamp */
        unsigned long write_bandwidth;  /* the estimated write bandwidth */
        unsigned long avg_write_bandwidth; /* further smoothed write bw */

> >  	struct prop_local_percpu completions;
> >  	int dirty_exceeded;
> >  
> > --- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-06-29 16:23:44.000000000 +0800
> > @@ -37,6 +37,11 @@
> >  #include <trace/events/writeback.h>
> >  
> >  /*
> > + * Sleep at most 200ms at a time in balance_dirty_pages().
> > + */
> > +#define MAX_PAUSE	max(HZ/5, 1)
>   You use this for ratelimiting updates of bdi bandwidth. So we should
> really have a separate (and properly commented) constant for it.

Yeah, how about this?

/*
 * Estimate write bandwidth at 200ms intervals.
 */
#define BANDWIDTH_INTERVAL      max(HZ/5, 1)

> > +static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > +				       unsigned long elapsed,
> > +				       unsigned long written)
> > +{
> > +	const unsigned long period = roundup_pow_of_two(3 * HZ);
> > +	unsigned long avg = bdi->avg_write_bandwidth;
> > +	unsigned long old = bdi->write_bandwidth;
> > +	unsigned long cur;
> > +	u64 bw;
> > +
> > +	bw = written - bdi->written_stamp;
> > +	bw *= HZ;
> > +	if (unlikely(elapsed > period / 2)) {
> > +		do_div(bw, elapsed);
> > +		elapsed = period / 2;
> > +		bw *= elapsed;
> > +	}
>   What is the rationale behind this condition? Why cannot we just use 
> written/elapsed when we have long enough period? Have you observed some

Yeah, in normal case the formula is

(written * HZ / elapsed) * elapsed + bdi->write_bandwidth * (period - elapsed)
------------------------------------------------------------------------------
                                 period

and for longer than half-period time, the formula gives equal weights to new
and smoothed old bandwidth:

(written * HZ / elapsed) * (period/2) + bdi->write_bandwidth * (period/2)
------------------------------------------------------------------------------
                                 period

which normally happens when background writeout is going on without
any tasks throttled (if throttled, they'll kick the bandwidth
updates), especially in the case of the initial write chunk for slow
devices.

For example, ->write_bandwidth is initialized to 100MB/s, resulting in
the initial 50MB write chunk size (in [PATCH 8/9]). Which will take 5s
when writing a large file to a 10MB/s USB stick.

In that case, we do need to quickly adapt to the real bandwidth. So
I'll take your approach. Thanks for pointing this out.

> problems with it (I estimate bandwidth in my patches like:
> static void update_bdi_throughput(struct backing_dev_info *bdi,
>                                 unsigned long written, unsigned long time)
> {
>        unsigned int deltams = jiffies_to_msecs(time - bdi->start_jiffies);
> 
>        written -= bdi->start_written;
>        if (deltams > WINDOW_MS) {
>                /* Add 1 to avoid 0 result */
>                bdi->pages_per_s = 1 + ((u64)written) * MSEC_PER_SEC / deltams;
>                return;
>        }
>        bdi->pages_per_s = 1 +
>                (((u64)bdi->pages_per_s) * (WINDOW_MS - deltams) +
>                 ((u64)written) * MSEC_PER_SEC) / WINDOW_MS;
> }
> 
> and I didn't observe any problems with it. Plus you need some comment
> explaining how you compute the throughput. I have something like:
> /*
>  * When the throughput is computed, we consider an imaginary WINDOW_MS
>  * miliseconds long window. In this window, we know that it took 'deltams'
>  * miliseconds to write 'written' pages and for the rest of the window we
>  * assume number of pages corresponding to the throughput we previously
>  * computed to have been written. Thus we obtain total number of pages
>  * written in the imaginary window and from it new throughput.
>  */

Oh I thought the code is clear enough because it's the standard
running average technique... or let's write down the before-simplified
formula?

        /*
         * bw = written * HZ / elapsed
         *
         *                   bw * elapsed + write_bandwidth * (period - elapsed)
         * write_bandwidth = ---------------------------------------------------
         *                                          period
         */

> > +	bw += (u64)bdi->write_bandwidth * (period - elapsed);
> > +	cur = bw >> ilog2(period);
> > +
> > +	/*
> > +	 * one more level of smoothing, for filtering out sudden spikes
> > +	 */
> > +	if (avg > old && old > cur)
> > +		avg -= (avg - old) >> 3;
> > +
> > +	if (avg < old && old < cur)
> > +		avg += (old - avg) >> 3;
> > +
> > +	bdi->write_bandwidth = cur;
> > +	bdi->avg_write_bandwidth = avg;
> > +}
> > +
> > +void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> > +			    unsigned long start_time)
> > +{
> > +	unsigned long now = jiffies;
> > +	unsigned long elapsed = now - bdi->bw_time_stamp;
> > +	unsigned long written;
> > +
> > +	/*
> > +	 * rate-limit, only update once every 200ms.
> > +	 */
> > +	if (elapsed < MAX_PAUSE)
> > +		return;
> > +
> > +	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> > +
> > +	/*
> > +	 * Skip quiet periods when disk bandwidth is under-utilized.
> > +	 * (at least 1s idle time between two flusher runs)
> > +	 */
> > +	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> > +		goto snapshot;
>   Cannot we just explicitely stamp the written_stamp and bw_time_stamp at
> the beginning of wb_writeback and be done with it? We wouldn't have to
> pass the time stamps, keep them in wb_writeback() and balance_dirty_pages()
> etc.

I'm afraid that stamping may unnecessarily disturb/invalidate valid
accounting periods. For example, if the flusher keeps working on tiny
works, it will effectively disable bandwidth updates.

> > +
> > +	bdi_update_write_bandwidth(bdi, elapsed, written);
> > +
> > +snapshot:
> > +	bdi->written_stamp = written;
> > +	bdi->bw_time_stamp = now;
> > +}
> > +
> > +static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> > +				 unsigned long start_time)
> > +{
> > +	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
> > +		return;
> > +	if (spin_trylock(&bdi->wb.list_lock)) {
>   I'm not sure this is a good idea - list_lock could be quite busy which
> would possibly quite delay updates of bandwidth. I'd just unconditionally
> take it.

OK, I'll do it. It's good to start it correct. We always have the
option to optimize if it's proved to be a hot spot.
 
Thanks,
Fengguang

> > +		__bdi_update_bandwidth(bdi, start_time);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> >  /*
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> > @@ -491,6 +569,7 @@ static void balance_dirty_pages(struct a
> >  	unsigned long pause = 1;
> >  	bool dirty_exceeded = false;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	unsigned long start_time = jiffies;
> >  
> >  	for (;;) {
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > @@ -545,6 +624,8 @@ static void balance_dirty_pages(struct a
> >  		if (!bdi->dirty_exceeded)
> >  			bdi->dirty_exceeded = 1;
> >  
> > +		bdi_update_bandwidth(bdi, start_time);
> > +
> >  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> >  		 * Unstable writes are a feature of certain networked
> >  		 * filesystems (i.e. NFS) in which data may have been
> > 
> 
> 								Honza
> -- 
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/