[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPQyPG5JS+C-zHyXFW1EXMfRyRS=4oM2mijPL9dOn7=0ubD4Eg@mail.gmail.com>
Date: Sat, 12 Nov 2011 13:44:34 +0800
From: Nai Xia <nai.xia@...il.com>
To: Wu Fengguang <fengguang.wu@...el.com>
Cc: linux-fsdevel@...r.kernel.org,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>,
Christoph Hellwig <hch@....de>,
Dave Chinner <david@...morbit.com>,
Greg Thelen <gthelen@...gle.com>,
Minchan Kim <minchan.kim@...il.com>,
Vivek Goyal <vgoyal@...hat.com>,
Andrea Righi <arighi@...eler.com>,
linux-mm <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 02/18] writeback: dirty position control
Hello Fengguang,
Is this the similar idea&algo behind TCP congestion control
since 2.6.19 ?
Same situation: Multiple tcp connections contending for network
bandwidth V.S. multiple process contending for BDI bandwidth.
Same solution: Per connection(v.s. process) speed control with cubic
algorithm controlled balancing.
:-)
Then the validness and efficiency in essence has been verified
in real world for years in another similar situation. Good to see we
are going to have it in write-back too!
Thanks,
Nai
On Sun, Sep 4, 2011 at 9:53 AM, Wu Fengguang <fengguang.wu@...el.com> wrote:
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
>
> Old scheme is,
> |
> free run area | throttle area
> ----------------------------------------+---------------------------->
> thresh^ dirty pages
>
> New scheme is,
>
> ^ task rate limit
> |
> | *
> | *
> | *
> |[free run] * [smooth throttled]
> | *
> | *
> | *
> ..bdi->dirty_ratelimit..........*
> | . *
> | . *
> | . *
> | . *
> | . *
> +-------------------------------.-----------------------*------------>
> setpoint^ limit^ dirty pages
>
> The slope of the bdi control line should be
>
> 1) large enough to pull the dirty pages to setpoint reasonably fast
>
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
> hence task ratelimit
>
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
>
> Assume the bdi control line
>
> pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
>
> where k is the negative slope.
>
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
>
> [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
>
> we get slope
>
> k = - 1 / (8 * write_bw)
>
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
>
> x_intercept = bdi_setpoint + 8 * write_bw
>
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
>
> 1) slope of global control line => scaling to the control scope size
> 2) slope of main bdi control line => scaling to the writeout bandwidth
>
> so that
>
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
> pages inside the control scope
>
> - in large memory systems where the "gravity" of (1) for pulling the
> dirty pages to setpoint is too weak, (2) can back (1) up and drive
> dirty pages to bdi_setpoint ~= setpoint reasonably fast.
>
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks. In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
>
> Given equations
>
> span = x_intercept - bdi_setpoint
> k = df/dx = - 1 / span
>
> and the extremum values
>
> span = bdi_thresh
> dx = bdi_thresh
>
> we get
>
> df = - dx / span = - 1.0
>
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.
>
> peter: use 3rd order polynomial for the global control line
>
> CC: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Acked-by: Jan Kara <jack@...e.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@...el.com>
> ---
> fs/fs-writeback.c | 2
> include/linux/writeback.h | 1
> mm/page-writeback.c | 213 +++++++++++++++++++++++++++++++++++-
> 3 files changed, 210 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-08-26 15:57:34.000000000 +0800
> @@ -46,6 +46,8 @@
> */
> #define BANDWIDTH_INTERVAL max(HZ/5, 1)
>
> +#define RATELIMIT_CALC_SHIFT 10
> +
> /*
> * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
> * will look to see if it needs to force writeback or throttling.
> @@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
> }
> EXPORT_SYMBOL(bdi_set_max_ratio);
>
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> + unsigned long bg_thresh)
> +{
> + return (thresh + bg_thresh) / 2;
> +}
> +
> static unsigned long hard_dirty_limit(unsigned long thresh)
> {
> return max(thresh, global_dirty_limit);
> @@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
> return bdi_dirty;
> }
>
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + * pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + * if (dirty < setpoint) scale up pos_ratio
> + * if (dirty > setpoint) scale down pos_ratio
> + *
> + * if (bdi_dirty < bdi_setpoint) scale up pos_ratio
> + * if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + * ^ pos_ratio
> + * |
> + * | |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + * | .*
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * 1.0 ................................*
> + * | . . *
> + * | . . *
> + * | . . *
> + * | . . *
> + * | . . *
> + * 0 +------------.------------------.----------------------*------------->
> + * freerun^ setpoint^ limit^ dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + * o
> + * o
> + * o [o] main control line
> + * o [*] auxiliary control line
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o--------------------- balance point, rate scale = 1
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o------- connect point, rate scale = 1/2
> + * | .*
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * [--------------------+-----------------------------.--------------------*]
> + * 0 bdi_setpoint x_intercept limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + * card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty)
> +{
> + unsigned long write_bw = bdi->avg_write_bandwidth;
> + unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> + unsigned long limit = hard_dirty_limit(thresh);
> + unsigned long x_intercept;
> + unsigned long setpoint; /* dirty pages' target balance point */
> + unsigned long bdi_setpoint;
> + unsigned long span;
> + long long pos_ratio; /* for scaling up/down the rate limit */
> + long x;
> +
> + if (unlikely(dirty >= limit))
> + return 0;
> +
> + /*
> + * global setpoint
> + *
> + * setpoint - dirty 3
> + * f(dirty) := 1.0 + (----------------)
> + * limit - setpoint
> + *
> + * it's a 3rd order polynomial that subjects to
> + *
> + * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
> + * (2) f(setpoint) = 1.0 => the balance point
> + * (3) f(limit) = 0 => the hard limit
> + * (4) df/dx <= 0 => negative feedback control
> + * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> + * => fast response on large errors; small oscillation near setpoint
> + */
> + setpoint = (freerun + limit) / 2;
> + x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> + limit - setpoint + 1);
> + pos_ratio = x;
> + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> + pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> + /*
> + * We have computed basic pos_ratio above based on global situation. If
> + * the bdi is over/under its share of dirty pages, we want to scale
> + * pos_ratio further down/up. That is done by the following mechanism.
> + */
> +
> + /*
> + * bdi setpoint
> + *
> + * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> + *
> + * x_intercept - bdi_dirty
> + * := --------------------------
> + * x_intercept - bdi_setpoint
> + *
> + * The main bdi control line is a linear function that subjects to
> + *
> + * (1) f(bdi_setpoint) = 1.0
> + * (2) k = - 1 / (8 * write_bw) (in single bdi case)
> + * or equally: x_intercept = bdi_setpoint + 8 * write_bw
> + *
> + * For single bdi case, the dirty pages are observed to fluctuate
> + * regularly within range
> + * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> + * for various filesystems, where (2) can yield in a reasonable 12.5%
> + * fluctuation range for pos_ratio.
> + *
> + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> + * own size, so move the slope over accordingly and choose a slope that
> + * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
> + */
> + if (unlikely(bdi_thresh > thresh))
> + bdi_thresh = thresh;
> + /*
> + * scale global setpoint to bdi's:
> + * bdi_setpoint = setpoint * bdi_thresh / thresh
> + */
> + x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> + bdi_setpoint = setpoint * (u64)x >> 16;
> + /*
> + * Use span=(8*write_bw) in single bdi case as indicated by
> + * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> + *
> + * bdi_thresh thresh - bdi_thresh
> + * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
> + * thresh thresh
> + */
> + span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
> + x_intercept = bdi_setpoint + span;
> +
> + span >>= 1;
> + if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> + if (unlikely(bdi_dirty > limit))
> + return 0;
> + if (x_intercept < limit) {
> + x_intercept = limit; /* auxiliary control line */
> + bdi_setpoint += span;
> + pos_ratio >>= 1;
> + }
> + }
> + pos_ratio *= x_intercept - bdi_dirty;
> + do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> + return pos_ratio;
> +}
> +
> static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> unsigned long elapsed,
> unsigned long written)
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>
> void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>
> static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct
> if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> return;
> spin_lock(&bdi->wb.list_lock);
> - __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> - start_time);
> + __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> + bdi_thresh, bdi_dirty, start_time);
> spin_unlock(&bdi->wb.list_lock);
> }
>
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> - if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> + if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> + background_thresh))
> break;
>
> bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
> if (!bdi->dirty_exceeded)
> bdi->dirty_exceeded = 1;
>
> - bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> - bdi_thresh, bdi_dirty, start_time);
> + bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> + nr_dirty, bdi_thresh, bdi_dirty,
> + start_time);
>
> /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
> static void wb_update_bandwidth(struct bdi_writeback *wb,
> unsigned long start_time)
> {
> - __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> + __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
> }
>
> /*
> --- linux-next.orig/include/linux/writeback.h 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h 2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>
> void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists