linux-kernel - Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 1 Oct 2009 16:22:43 +0200
From:	Jan Kara <jack@...e.cz>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	Jan Kara <jack@...e.cz>, Peter Zijlstra <peterz@...radead.org>,
	Chris Mason <chris.mason@...cle.com>,
	Artem Bityutskiy <dedekind1@...il.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"david@...morbit.com" <david@...morbit.com>,
	"hch@...radead.org" <hch@...radead.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	Theodore Ts'o <tytso@....edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb

On Thu 01-10-09 21:36:10, Wu Fengguang wrote:
> > > --- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
> > > +++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
> > > @@ -25,6 +25,7 @@
> > >  #include <linux/blkdev.h>
> > >  #include <linux/backing-dev.h>
> > >  #include <linux/buffer_head.h>
> > > +#include <linux/completion.h>
> > >  #include "internal.h"
> > >  
> > >  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> > > @@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
> > >  		call_rcu(&work->rcu_head, bdi_work_free);
> > >  }
> > >  
> > > -static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
> > > +static void wb_clear_pending(struct backing_dev_info *bdi,
> > > +			     struct bdi_work *work)
> > >  {
> > >  	/*
> > >  	 * The caller has retrieved the work arguments from this work,
> > >  	 * drop our reference. If this is the last ref, delete and free it
> > >  	 */
> > >  	if (atomic_dec_and_test(&work->pending)) {
> > > -		struct backing_dev_info *bdi = wb->bdi;
> > >  
> > >  		spin_lock(&bdi->wb_lock);
> > >  		list_del_rcu(&work->list);
> > > @@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
> > >  	bdi_alloc_queue_work(bdi, &args);
> > >  }
> > >  
> > > +struct dirty_throttle_task {
> > > +	long			nr_pages;
> > > +	struct list_head	list;
> > > +	struct completion	complete;
> > > +};
> > > +
> > > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > > +{
> > > +	struct dirty_throttle_task tt = {
> > > +		.nr_pages = nr_pages,
> > > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > > +	};
> > > +	struct wb_writeback_args args = {
> > > +		.sync_mode	= WB_SYNC_NONE,
> > > +		.nr_pages	= LONG_MAX,
> > > +		.range_cyclic	= 1,
> > > +		.for_background	= 1,
> > > +	};
> > > +	struct bdi_work work;
> > > +
> > > +	bdi_work_init(&work, &args);
> > > +	work.state |= WS_ONSTACK;
> > > +
> > > +	/*
> > > +	 * make sure we will be waken up by someone
> > > +	 */
> > > +	bdi_queue_work(bdi, &work);
> >   This is wrong, you shouldn't submit the work like this because you'll
> > have to wait for completion (wb_clear_pending below is just bogus). You
> > should rather do bdi_start_writeback(bdi, NULL, 0).
> 
> No I don't intent to wait for completion of this work (that would
> wait too long). This bdi work is to ensure writeback IO submissions
> are now in progress. Thus __bdi_writeout_inc() will be called to
> decrease bdi->throttle_pages, and when it counts down to 0, wake up
> this process.
> 
> The alternative way is to do
> 
>         if (no background work queued)
>                 bdi_start_writeback(bdi, NULL, 0)
> 
> It looks a saner solution, thanks for the suggestion :)
  Yes, but you'll have hard time finding whether there's background work
queued or not. So I suggest you just queue the background writeout
unconditionally.

> > > +
> > > +	/*
> > > +	 * register throttle pages
> > > +	 */
> > > +	spin_lock(&bdi->throttle_lock);
> > > +	if (list_empty(&bdi->throttle_list))
> > > +		atomic_set(&bdi->throttle_pages, nr_pages);
> > > +	list_add(&tt.list, &bdi->throttle_list);
> > > +	spin_unlock(&bdi->throttle_lock);
> > > +
> > > +	wait_for_completion(&tt.complete);
> 
> > > +	wb_clear_pending(bdi, &work); /* XXX */
> 
> For the above reason, I remove the work here and don't care whether it
> has been executed or is running or not seen at all. We have been waken up.
> 
> Sorry I definitely "misused" wb_clear_pending() for a slightly
> different purpose..
> 
> This didn't really cancel the work if it has already been running.
> So bdi_writeback_wait() achieves another goal of starting background
> writeback if bdi-flush is previously idle.
> 
> > > +}
> > > +
> > > +/*
> > > + * return 1 if there are more waiting tasks.
> > > + */
> > > +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> > > +{
> > > +	struct dirty_throttle_task *tt;
> > > +
> > > +	spin_lock(&bdi->throttle_lock);
> > > +	/*
> > > +	 * remove and wakeup head task
> > > +	 */
> > > +	if (!list_empty(&bdi->throttle_list)) {
> > > +		tt = list_entry(bdi->throttle_list.prev,
> > > +				struct dirty_throttle_task, list);
> > > +		list_del(&tt->list);
> > > +		complete(&tt->complete);
> > > +	}
> > > +	/*
> > > +	 * update throttle pages
> > > +	 */
> > > +	if (!list_empty(&bdi->throttle_list)) {
> > > +		tt = list_entry(bdi->throttle_list.prev,
> > > +				struct dirty_throttle_task, list);
> > > +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> > > +	} else {
> > > +		tt = NULL;
> > > +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> >   Why is here * 2?
>  
> Because we do a racy test in another place:
> 
>     +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
>     +	    atomic_dec_and_test(&bdi->throttle_pages))
>     +		bdi_writeback_wakeup(bdi);
> 
> The *2 is for reducing the race possibility. It might still be racy, but
> that's OK, because it's mainly an optimization. It's perfectly correct
> if we simply do 
  Ah, I see. OK, then it deserves at least a comment...

>     +	if (atomic_dec_and_test(&bdi->throttle_pages))
>     +		bdi_writeback_wakeup(bdi);
> 
> > > +	}
> > > +	spin_unlock(&bdi->throttle_lock);
> > > +
> > > +	return tt != NULL;
> > > +}
> > > +
> > >  /*
> > >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> > >   * furthest end of its superblock's dirty-inode list.
> > > @@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
> > >  		 * For background writeout, stop when we are below the
> > >  		 * background dirty threshold
> > >  		 */
> > > -		if (args->for_background && !over_bground_thresh())
> > > +		if (args->for_background && !over_bground_thresh()) {
> > > +			while (bdi_writeback_wakeup(wb->bdi))
> > > +				;  /* unthrottle all tasks */
> > >  			break;
> > > +		}
> >   You probably didn't understand my comment in the previous email. This is
> > too late to wakeup all the tasks. There are two limits - background_limit
> > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > above background_limit, we start the writeback but we don't throttle tasks
> > yet. We start throttling tasks only when amount of dirty data on the bdi
> > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > single bdi, this means we start throttling threads only when 10% of memory
> > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > as their BDI gets below the dirty limit or when global number of dirty
> > pages gets below (background_limit + dirty_limit) / 2.
> 
> Sure, but the design goal is to wakeup the throttled tasks in the
> __bdi_writeout_inc() path instead of here. As long as some (background)
> writeback is running, __bdi_writeout_inc() will be called to wakeup
> the tasks.  This "unthrottle all on exit of background writeback" is
> merely a safeguard, since once background writeback (which could be
> queued by the throttled task itself, in bdi_writeback_wait) exits, the
> calls to __bdi_writeout_inc() is likely to stop.
  The thing is: In the old code, tasks returned from balance_dirty_pages()
as soon as we got below dirty_limit, regardless of how much they managed to
write. So we want to wake them up from waiting as soon as we get below the
dirty limit (maybe a bit later so that they don't immediately block again
but I hope you get the point).

							Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/