[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFPAmTQ3jN8RF5-7E92AoGAGMz5H0GrPxkgJ0O6u_MViGC6KnQ@mail.gmail.com>
Date: Thu, 18 Aug 2011 16:58:55 +0530
From: Kautuk Consul <consul.kautuk@...il.com>
To: Wu Fengguang <fengguang.wu@...el.com>
Cc: Mel Gorman <mel@...ux.vnet.ibm.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] writeback: Per-block device bdi->dirty_writeback_interval
and bdi->dirty_expire_interval.
Hi Wu,
Thanks for responding.
Please find my comments inline in your email below.
On Thu, Aug 18, 2011 at 3:18 PM, Wu Fengguang <fengguang.wu@...el.com> wrote:
> Hi Kautuk,
>
> Add CC to fsdevel and Mel and KOSAKI.
>
> When submitting patches you can find the relevant mailing list and
> developers to CC with this command under the kernel source tree:
>
> scripts/get_maintainer.pl YOUR-PATCH-FILE
>
> On Thu, Aug 11, 2011 at 05:50:56PM +0530, Kautuk Consul wrote:
>> Hi,
>>
>> Currently the /proc/sys/vm/dirty_writeback_centisecs and
>> /proc/sys/vm/dirty_expire_centisecs values are
>> global to the system.
>> All the BDI flush-* threads are controlled by these central values.
>
> Yes.
>
>> However, the user/admin might want to set different writeback speeds
>> for different block devices based on
>> their page write-back performance.
>
> How can the above two sysctl values impact "writeback speeds"?
> In particular, what's the "speed" you mean?
>
By writeback speed, I meant writeback interval, i.e. the maximum
interval after which the BDI
thread for a particular block device can wake up and try to sync pages
with disk.
>> For example, the user might want to write-back pages in smaller
>> intervals to a block device which has a
>> faster known writeback speed.
>
> That's not a complete rational. What does the user ultimately want by
> setting a smaller interval? What would be the problems to the other
> slow devices if the user does so by simply setting a small value
> _globally_?
>
I think that the user might want to set a smaller interval for faster block
devices so that the dirty pages are synced with that block device/disk sooner.
This will unset the dirty bit of the page-cache pages sooner, which
will increase the
possibility of those pages getting reclaimed quickly in high memory
usage scenarios.
For a system that writes to disk very frequently and runs a lot of
memory intensive user-mode
applications, this might be crucial for their performance as they
would possibly have to sleep
comparitively lesser during page allocation.
For example, an server handling a database needs frequent disk access
as well as
anonymous memory. In such a case it would be nice to keep the
write-back interval for a USB pen
drive BDI thread as more than that of a SATA/SCSI disk.
> We need strong use cases for doing such user interface changes.
> Would you detail the problem and the pains that can only (or best)
> be addressed by this patch?
>
Overall, I think that ever since there have been different BDI threads
for different block devices,
it seems quite rational to provide the user an option to set different
writeback intervals to different
block devices due to the reasons/examples I have mentioned above.
I do not fully theoretically understand the way your patches are
controlling the dirty rate and estimating
the future bandwidth.
But, when I looked through them I did not see any place where the
writeback interval for a BDI was being
changed.
So, I felt that my patch was more like an additional feature for the
user rather than a conflict with your
writeback patches.
> Thanks,
> Fengguang
>
>> This patch creates 3 new counters (in centisecs) for all the BDI
>> threads that were controlled centrally by these
>> 2 counters:
>> i) /sys/block/<block_dev>/bdi/dirty_writeback_interval,
>> ii) /sys/block/<block_dev>/bdi/dirty_expire_interval,
>> iii) /proc/sys/vm/sync_supers_centisecs.
>>
>> Although these new counters can be tuned individually, I have taken
>> care that they be centrally reset by changes
>> to the /proc/sys/vm/dirty_expire_centisecs and
>> /proc/sys/vm/dirty_writeback_centisecs so that the earlier
>> functionality is not broken by distributions using these central values.
>> After resetting all values centrally, these values can be tuned
>> individually without altering the central values.
>>
>> Signed-off-by: Kautuk Consul <consul.kautuk@...il.com>
>> ---
>>
>> diff -uprN a/fs/fs-writeback.c b/fs/fs-writeback.c
>> --- a/fs/fs-writeback.c 2011-08-05 10:29:21.000000000 +0530
>> +++ b/fs/fs-writeback.c 2011-08-09 09:15:37.093041675 +0530
>> @@ -638,8 +638,8 @@ static inline bool over_bground_thresh(v
>> * just walks the superblock inode list, writing back any inodes which are
>> * older than a specific point in time.
>> *
>> - * Try to run once per dirty_writeback_interval. But if a writeback event
>> - * takes longer than a dirty_writeback_interval interval, then leave a
>> + * Try to run once per bdi->dirty_writeback_interval. But if a writeback event
>> + * takes longer than a bdi->dirty_writeback_interval interval, then leave a
>> * one-second gap.
>> *
>> * older_than_this takes precedence over nr_to_write. So we'll only write back
>> @@ -663,7 +663,7 @@ static long wb_writeback(struct bdi_writ
>> if (wbc.for_kupdate) {
>> wbc.older_than_this = &oldest_jif;
>> oldest_jif = jiffies -
>> - msecs_to_jiffies(dirty_expire_interval * 10);
>> + msecs_to_jiffies(wb->bdi->dirty_expire_interval * 10);
>> }
>> if (!wbc.range_cyclic) {
>> wbc.range_start = 0;
>> @@ -811,15 +811,16 @@ static long wb_check_old_data_flush(stru
>> {
>> unsigned long expired;
>> long nr_pages;
>> + struct backing_dev_info *bdi = wb->bdi;
>>
>> /*
>> * When set to zero, disable periodic writeback
>> */
>> - if (!dirty_writeback_interval)
>> + if (!bdi->dirty_writeback_interval)
>> return 0;
>>
>> expired = wb->last_old_flush +
>> - msecs_to_jiffies(dirty_writeback_interval * 10);
>> + msecs_to_jiffies(bdi->dirty_writeback_interval * 10);
>> if (time_before(jiffies, expired))
>> return 0;
>>
>> @@ -923,8 +924,8 @@ int bdi_writeback_thread(void *data)
>> continue;
>> }
>>
>> - if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>> - schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
>> + if (wb_has_dirty_io(wb) && bdi->dirty_writeback_interval)
>> + schedule_timeout(msecs_to_jiffies(bdi->dirty_writeback_interval * 10));
>> else {
>> /*
>> * We have nothing to do, so can go sleep without any
>> diff -uprN a/include/linux/backing-dev.h b/include/linux/backing-dev.h
>> --- a/include/linux/backing-dev.h 2011-08-05 10:29:21.000000000 +0530
>> +++ b/include/linux/backing-dev.h 2011-08-09 09:15:37.094041619 +0530
>> @@ -76,6 +76,8 @@ struct backing_dev_info {
>>
>> unsigned int min_ratio;
>> unsigned int max_ratio, max_prop_frac;
>> + unsigned int dirty_writeback_interval;
>> + unsigned int dirty_expire_interval;
>>
>> struct bdi_writeback wb; /* default writeback info for this bdi */
>> spinlock_t wb_lock; /* protects work_list */
>> @@ -333,4 +335,5 @@ static inline int bdi_sched_wait(void *w
>> return 0;
>> }
>>
>> +extern unsigned int shortest_dirty_writeback_interval;
>> #endif /* _LINUX_BACKING_DEV_H */
>> diff -uprN a/include/linux/writeback.h b/include/linux/writeback.h
>> --- a/include/linux/writeback.h 2011-08-05 10:29:21.000000000 +0530
>> +++ b/include/linux/writeback.h 2011-08-09 10:09:23.581268260 +0530
>> @@ -100,6 +100,7 @@ extern unsigned long dirty_background_by
>> extern int vm_dirty_ratio;
>> extern unsigned long vm_dirty_bytes;
>> extern unsigned int dirty_writeback_interval;
>> +extern unsigned int sync_supers_interval;
>> extern unsigned int dirty_expire_interval;
>> extern int vm_highmem_is_dirtyable;
>> extern int block_dump;
>> @@ -123,6 +124,10 @@ extern int dirty_bytes_handler(struct ct
>> struct ctl_table;
>> int dirty_writeback_centisecs_handler(struct ctl_table *, int,
>> void __user *, size_t *, loff_t *);
>> +int sync_supers_centisecs_handler(struct ctl_table *, int,
>> + void __user *, size_t *, loff_t *);
>> +int dirty_expire_centisecs_handler(struct ctl_table *, int,
>> + void __user *, size_t *, loff_t *);
>>
>> void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
>> unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
>> diff -uprN a/kernel/sysctl.c b/kernel/sysctl.c
>> --- a/kernel/sysctl.c 2011-08-05 10:29:21.000000000 +0530
>> +++ b/kernel/sysctl.c 2011-08-09 12:39:43.453087554 +0530
>> @@ -1076,12 +1076,19 @@ static struct ctl_table vm_table[] = {
>> .mode = 0644,
>> .proc_handler = dirty_writeback_centisecs_handler,
>> },
>> + {
>> + .procname = "sync_supers_centisecs",
>> + .data = &sync_supers_interval,
>> + .maxlen = sizeof(sync_supers_interval),
>> + .mode = 0644,
>> + .proc_handler = sync_supers_centisecs_handler,
>> + },
>> {
>> .procname = "dirty_expire_centisecs",
>> .data = &dirty_expire_interval,
>> .maxlen = sizeof(dirty_expire_interval),
>> .mode = 0644,
>> - .proc_handler = proc_dointvec_minmax,
>> + .proc_handler = dirty_expire_centisecs_handler,
>> .extra1 = &zero,
>> },
>> {
>> diff -uprN a/mm/backing-dev.c b/mm/backing-dev.c
>> --- a/mm/backing-dev.c 2011-08-05 10:29:21.000000000 +0530
>> +++ b/mm/backing-dev.c 2011-08-09 12:08:06.287079027 +0530
>> @@ -39,6 +39,10 @@ DEFINE_SPINLOCK(bdi_lock);
>> LIST_HEAD(bdi_list);
>> LIST_HEAD(bdi_pending_list);
>>
>> +/* Same value as the dirty_writeback_interval as this is what our
>> + * initial shortest_dirty_writeback_interval. */
>> +unsigned int shortest_dirty_writeback_interval = 5 * 100;
>> +
>> static struct task_struct *sync_supers_tsk;
>> static struct timer_list sync_supers_timer;
>>
>> @@ -204,12 +208,50 @@ static ssize_t max_ratio_store(struct de
>> }
>> BDI_SHOW(max_ratio, bdi->max_ratio)
>>
>> +static ssize_t dirty_writeback_interval_store(struct device *dev,
>> + struct device_attribute *attr, const char *buf, size_t count)
>> +{
>> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
>> + char *end;
>> + unsigned int interval;
>> + ssize_t ret = -EINVAL;
>> +
>> + interval = simple_strtoul(buf, &end, 10);
>> + if (*buf && (end[0] == '\0' || (end[0] == '\n' && end[1] == '\0'))) {
>> + bdi->dirty_writeback_interval = interval;
>> + shortest_dirty_writeback_interval =
>> + min(shortest_dirty_writeback_interval,interval);
>> + ret = count;
>> + }
>> + return ret;
>> +}
>> +BDI_SHOW(dirty_writeback_interval, bdi->dirty_writeback_interval)
>> +
>> +static ssize_t dirty_expire_interval_store (struct device *dev,
>> + struct device_attribute *attr, const char *buf, size_t count)
>> +{
>> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
>> + char *end;
>> + unsigned int interval;
>> + ssize_t ret = -EINVAL;
>> +
>> + interval = simple_strtoul(buf, &end, 10);
>> + if (*buf && (end[0] == '\0' || (end[0] == '\n' && end[1] == '\0'))) {
>> + bdi->dirty_expire_interval = interval;
>> + ret = count;
>> + }
>> + return ret;
>> +}
>> +BDI_SHOW(dirty_expire_interval, bdi->dirty_expire_interval)
>> +
>> #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>>
>> static struct device_attribute bdi_dev_attrs[] = {
>> __ATTR_RW(read_ahead_kb),
>> __ATTR_RW(min_ratio),
>> __ATTR_RW(max_ratio),
>> + __ATTR_RW(dirty_writeback_interval),
>> + __ATTR_RW(dirty_expire_interval),
>> __ATTR_NULL,
>> };
>>
>> @@ -291,7 +333,7 @@ void bdi_arm_supers_timer(void)
>> if (!dirty_writeback_interval)
>> return;
>>
>> - next = msecs_to_jiffies(dirty_writeback_interval * 10) + jiffies;
>> + next = msecs_to_jiffies(sync_supers_interval* 10) + jiffies;
>> mod_timer(&sync_supers_timer, round_jiffies_up(next));
>> }
>>
>> @@ -336,7 +378,7 @@ void bdi_wakeup_thread_delayed(struct ba
>> {
>> unsigned long timeout;
>>
>> - timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
>> + timeout = msecs_to_jiffies(bdi->dirty_writeback_interval * 10);
>> mod_timer(&bdi->wb.wakeup_timer, jiffies + timeout);
>> }
>>
>> @@ -348,7 +390,19 @@ static unsigned long bdi_longest_inactiv
>> {
>> unsigned long interval;
>>
>> - interval = msecs_to_jiffies(dirty_writeback_interval * 10);
>> + interval = msecs_to_jiffies(shortest_dirty_writeback_interval * 10);
>> + return max(5UL * 60 * HZ, interval);
>> +}
>> +
>> +/*
>> + * Calculate the longest interval (jiffies) this bdi thread is allowed to be
>> + * inactive.
>> + */
>> +static unsigned long bdi_longest_inactive_this(struct backing_dev_info *bdi)
>> +{
>> + unsigned long interval;
>> +
>> + interval = msecs_to_jiffies(bdi->dirty_writeback_interval * 10);
>> return max(5UL * 60 * HZ, interval);
>> }
>>
>> @@ -422,7 +476,7 @@ static int bdi_forker_thread(void *ptr)
>> */
>> if (bdi->wb.task && !have_dirty_io &&
>> time_after(jiffies, bdi->wb.last_active +
>> - bdi_longest_inactive())) {
>> + bdi_longest_inactive_this(bdi))) {
>> task = bdi->wb.task;
>> bdi->wb.task = NULL;
>> spin_unlock(&bdi->wb_lock);
>> @@ -469,7 +523,7 @@ static int bdi_forker_thread(void *ptr)
>> break;
>>
>> case NO_ACTION:
>> - if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
>> + if (!wb_has_dirty_io(me) || !me->bdi->dirty_writeback_interval)
>> /*
>> * There are no dirty data. The only thing we
>> * should now care about is checking for
>> @@ -479,7 +533,7 @@ static int bdi_forker_thread(void *ptr)
>> */
>> schedule_timeout(bdi_longest_inactive());
>> else
>> - schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
>> + schedule_timeout(msecs_to_jiffies(me->bdi->dirty_writeback_interval * 10));
>> try_to_freeze();
>> /* Back to the main loop */
>> continue;
>> @@ -641,6 +695,8 @@ int bdi_init(struct backing_dev_info *bd
>> bdi->min_ratio = 0;
>> bdi->max_ratio = 100;
>> bdi->max_prop_frac = PROP_FRAC_BASE;
>> + bdi->dirty_writeback_interval = dirty_writeback_interval;
>> + bdi->dirty_expire_interval = dirty_expire_interval;
>> spin_lock_init(&bdi->wb_lock);
>> INIT_LIST_HEAD(&bdi->bdi_list);
>> INIT_LIST_HEAD(&bdi->work_list);
>> diff -uprN a/mm/page-writeback.c b/mm/page-writeback.c
>> --- a/mm/page-writeback.c 2011-08-05 10:29:21.000000000 +0530
>> +++ b/mm/page-writeback.c 2011-08-09 13:09:37.985919961 +0530
>> @@ -92,6 +92,11 @@ unsigned long vm_dirty_bytes;
>> unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */
>>
>> /*
>> + * The interval between sync_supers thread writebacks
>> + */
>> +unsigned int sync_supers_interval = 5 * 100; /* centiseconds */
>> +
>> +/*
>> * The longest time for which data is allowed to remain dirty
>> */
>> unsigned int dirty_expire_interval = 30 * 100; /* centiseconds */
>> @@ -686,8 +691,60 @@ void throttle_vm_writeout(gfp_t gfp_mask
>> int dirty_writeback_centisecs_handler(ctl_table *table, int write,
>> void __user *buffer, size_t *length, loff_t *ppos)
>> {
>> + struct backing_dev_info *bdi;
>> +
>> + proc_dointvec(table, write, buffer, length, ppos);
>> +
>> + if (write) {
>> + /* Traverse all the BDIs registered to the BDI list and reset their
>> + * bdi->dirty_writeback_interval to this value. */
>> + spin_lock_bh(&bdi_lock);
>> + list_for_each_entry(bdi, &bdi_list, bdi_list)
>> + bdi->dirty_writeback_interval = dirty_writeback_interval;
>> + spin_unlock_bh(&bdi_lock);
>> +
>> + sync_supers_interval =
>> + shortest_dirty_writeback_interval = dirty_writeback_interval;
>> +
>> + }
>> +
>> + bdi_arm_supers_timer();
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * sysctl handler for /proc/sys/vm/sync_supers_centisecs
>> + */
>> +int sync_supers_centisecs_handler(ctl_table *table, int write,
>> + void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> proc_dointvec(table, write, buffer, length, ppos);
>> +
>> bdi_arm_supers_timer();
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * sysctl handler for /proc/sys/vm/dirty_expire_centisecs
>> + */
>> +int dirty_expire_centisecs_handler(ctl_table *table, int write,
>> + void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> + struct backing_dev_info *bdi;
>> +
>> + proc_dointvec_minmax(table, write, buffer, length, ppos);
>> +
>> + if (write) {
>> + /* Traverse all the BDIs registered to the BDI list and reset their
>> + * bdi->dirty_expire_interval to this value. */
>> + spin_lock_bh(&bdi_lock);
>> + list_for_each_entry(bdi, &bdi_list, bdi_list)
>> + bdi->dirty_expire_interval = dirty_expire_interval;
>> + spin_unlock_bh(&bdi_lock);
>> + }
>> +
>> return 0;
>> }
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists