[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120920084422.GA5697@localhost>
Date: Thu, 20 Sep 2012 16:44:22 +0800
From: Fengguang Wu <fengguang.wu@...el.com>
To: Namjae Jeon <linkinjeon@...il.com>
Cc: Jan Kara <jack@...e.cz>, Dave Chinner <david@...morbit.com>,
linux-kernel@...r.kernel.org,
Namjae Jeon <namjae.jeon@...sung.com>,
Vivek Trivedi <t.vivek@...sung.com>,
Linux Memory Management List <linux-mm@...ck.org>,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi
variable
[ CC FS and MM lists ]
Patch looks good to me, however we need to be careful because it's
introducing a new interface. So it's desirable to get some acks from
the FS/MM developers.
Thanks,
Fengguang
On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> From: Namjae Jeon <namjae.jeon@...sung.com>
>
> This patch is based on suggestion by Wu Fengguang:
> https://lkml.org/lkml/2011/8/19/19
>
> kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> ratio. It also maintains per task dirty rate limit to keep balance of
> dirty pages at any given instance by doing bdi bandwidth estimation.
>
> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> writecache to control per bdi dirty limits and task throttling.
>
> However, there might be a usecase where user wants a per bdi writeback tuning
> parameter to flush dirty data once per bdi dirty data reach a threshold
> especially at NFS server.
>
> dirty_background_centisecs provides an interface where user can tune
> background writeback start threshold using
> /sys/block/sda/bdi/dirty_background_centisecs
>
> dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback.
>
> One of the use case to demonstrate the patch functionality can be
> on NFS setup:-
> We have a NFS setup with ethernet line of 100Mbps, while the USB
> disk is attached to server, which has a local speed of 25MBps. Server
> and client both are arm target boards.
>
> Now if we perform a write operation over NFS (client to server), as
> per the network speed, data can travel at max speed of 100Mbps. But
> if we check the default write speed of USB hdd over NFS it comes
> around to 8MB/sec, far below the speed of network.
>
> Reason being is as per the NFS logic, during write operation, initially
> pages are dirtied on NFS client side, then after reaching the dirty
> threshold/writeback limit (or in case of sync) data is actually sent
> to NFS server (so now again pages are dirtied on server side). This
> will be done in COMMIT call from client to server i.e if 100MB of data
> is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
>
> After the data is received, now it will take approx 100/25 ~4 Seconds to
> write the data to USB Hdd on server side. Hence making the overall time
> to write this much of data ~12 seconds, which in practically comes out to
> be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> client.
>
> However we may improve this write performace by making the use of NFS
> server idle time i.e while data is being received from the client,
> simultaneously initiate the writeback thread on server side. So instead
> of waiting for the complete data to come and then start the writeback,
> we can work in parallel while the network is still busy in receiving the
> data. Hence in this way overall performace will be improved.
>
> If we tune dirty_background_centisecs, we can see there
> is increase in the performace and it comes out to be ~ 11MB/seconds.
> Results are:-
>
> Write test(create a 1 GB file) result at 'NFS client' after changing
> /sys/block/sda/bdi/dirty_background_centisecs
> on *** NFS Server only - not on NFS Client ****
>
> ---------------------------------------------------------------------
> |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> ---------------------------------------------------------------------
> | | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> ---------------------------------------------------------------------
> |RecSize | WriteSpeed | WriteSpeed | WriteSpeed | WriteSpeed |
> ---------------------------------------------------------------------
> |10485760 | 8.44MB/sec | 8.60MB/sec | 9.30MB/sec | 10.27MB/sec |
> | 1048576 | 8.48MB/sec | 8.87MB/sec | 9.31MB/sec | 10.34MB/sec |
> | 524288 | 8.37MB/sec | 8.42MB/sec | 9.84MB/sec | 10.47MB/sec |
> | 262144 | 8.16MB/sec | 8.51MB/sec | 9.52MB/sec | 10.62MB/sec |
> | 131072 | 8.48MB/sec | 8.81MB/sec | 9.42MB/sec | 10.55MB/sec |
> | 65536 | 8.38MB/sec | 9.09MB/sec | 9.76MB/sec | 10.53MB/sec |
> | 32768 | 8.65MB/sec | 9.00MB/sec | 9.57MB/sec | 10.54MB/sec |
> | 16384 | 8.27MB/sec | 8.80MB/sec | 9.39MB/sec | 10.43MB/sec |
> | 8192 | 8.52MB/sec | 8.70MB/sec | 9.40MB/sec | 10.50MB/sec |
> | 4096 | 8.20MB/sec | 8.63MB/sec | 9.80MB/sec | 10.35MB/sec |
> ---------------------------------------------------------------------
>
> we can see, average write speed is increased to ~10-11MB/sec.
> ============================================================
>
> This patch provides the changes per block devices. So that we may modify the
> dirty_background_centisecs as per the device and overall system is not impacted
> by the changes and we get improved perforamace in certain use cases.
>
> NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback. But, bdi write bandwidth estimation
> is an _estimation_ and may become wildly wrong. dirty_background_centisecs
> tuning may not always work to the user expectations. dirty_background_centisecs
> will require careful tuning by users on NFS Server.
> As a good use case, dirty_background_time should be set around 100 (1 sec).
> It should not be set to very small value, otherwise it will start
> flushing for small I/O size dirty data.
>
> Changes since v1:
> * make default value of 'dirty_background_centisecs = 0' sothat there is no change
> in default writeback behaviour.
> * Add description of dirty_background_centisecs in documentation.
>
> Original-patch-by: Wu Fengguang <fengguang.wu@...el.com>
> Signed-off-by: Namjae Jeon <namjae.jeon@...sung.com>
> Signed-off-by: Vivek Trivedi <t.vivek@...sung.com>
> ---
> fs/fs-writeback.c | 21 +++++++++++++++++++--
> include/linux/backing-dev.h | 1 +
> include/linux/writeback.h | 1 +
> mm/backing-dev.c | 21 +++++++++++++++++++++
> mm/page-writeback.c | 3 ++-
> 5 files changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index fd255c0..c427130 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
> return false;
> }
>
> +bool over_dirty_bground_time(struct backing_dev_info *bdi)
> +{
> + unsigned long background_thresh;
> +
> + if (!bdi->dirty_background_centisecs)
> + return false;
> +
> + background_thresh = bdi->avg_write_bandwidth *
> + bdi->dirty_background_centisecs / 100;
> +
> + if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
> + return true;
> +
> + return false;
> +}
> +
> /*
> * Called under wb->list_lock. If there are multiple wb per bdi,
> * only the flusher working on the first wb should do it.
> @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb,
> * For background writeout, stop when we are below the
> * background dirty threshold
> */
> - if (work->for_background && !over_bground_thresh(wb->bdi))
> + if (work->for_background && !over_bground_thresh(wb->bdi) &&
> + !over_dirty_bground_time(wb->bdi))
> break;
>
> /*
> @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void)
>
> static long wb_check_background_flush(struct bdi_writeback *wb)
> {
> - if (over_bground_thresh(wb->bdi)) {
> + if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
>
> struct wb_writeback_work work = {
> .nr_pages = LONG_MAX,
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..43d2e15 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -95,6 +95,7 @@ struct backing_dev_info {
>
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
> + unsigned int dirty_background_centisecs;
>
> struct bdi_writeback wb; /* default writeback info for this bdi */
> spinlock_t wb_lock; /* protects work_list */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c3e8f..6dc2abe 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
> long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
> void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
> void inode_wait_for_writeback(struct inode *inode);
> +bool over_dirty_bground_time(struct backing_dev_info *bdi);
>
> /* writeback.h requires fs.h; it, too, is not included from here. */
> static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..b1b2fd2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev,
> }
> BDI_SHOW(max_ratio, bdi->max_ratio)
>
> +static ssize_t dirty_background_centisecs_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t count)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> + unsigned int centisecs;
> + ssize_t ret;
> +
> + ret = kstrtouint(buf, 10, ¢isecs);
> + if (ret < 0)
> + return ret;
> +
> + bdi->dirty_background_centisecs = centisecs;
> + if (over_dirty_bground_time(bdi))
> + bdi_start_background_writeback(bdi);
> +
> + return count;
> +}
> +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs)
> +
> #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>
> static struct device_attribute bdi_dev_attrs[] = {
> __ATTR_RW(read_ahead_kb),
> __ATTR_RW(min_ratio),
> __ATTR_RW(max_ratio),
> + __ATTR_RW(dirty_background_centisecs),
> __ATTR_NULL,
> };
>
> @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->min_ratio = 0;
> bdi->max_ratio = 100;
> bdi->max_prop_frac = FPROP_FRAC_BASE;
> + bdi->dirty_background_centisecs = 0;
> spin_lock_init(&bdi->wb_lock);
> INIT_LIST_HEAD(&bdi->bdi_list);
> INIT_LIST_HEAD(&bdi->work_list);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5ad5ce2..8c1530d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1403,7 +1403,8 @@ pause:
> if (laptop_mode)
> return;
>
> - if (nr_reclaimable > background_thresh)
> + if (nr_reclaimable > background_thresh ||
> + over_dirty_bground_time(bdi))
> bdi_start_background_writeback(bdi);
> }
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists