linux-kernel - Re: [patch 1/3] fix illogical behavior in balance_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070325153508.10922ebd.akpm@linux-foundation.org>
Date:	Sun, 25 Mar 2007 15:35:08 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Miklos Szeredi <miklos@...redi.hu>
Cc:	dgc@....com, linux-kernel@...r.kernel.org
Subject: Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()

On Sat, 24 Mar 2007 22:55:29 +0100 Miklos Szeredi <miklos@...redi.hu> wrote:

> This is a slightly different take on the fix for the deadlock in fuse
> with dirty balancing.  David Chinner convinced me, that per-bdi
> counters are too expensive, and that it's not worth trying to account
> the number of pages under writeback, as they will be limited by the
> queue anyway.
> 
> ----
> From: Miklos Szeredi <mszeredi@...e.cz>
> 
> Current behavior of balance_dirty_pages() is to try to start writeout
> into the specified queue for at least "write_chunk" number of pages.
> If "write_chunk" pages have been submitted, then return.
> 
> However if there are less than "write_chunk" dirty pages for this
> queue, then it doesn't return, waiting for the global dirty+writeback
> counters to subside, but without doing any actual work.
> 
> This is illogical behavior: it allows more dirtyings while there are
> dirty pages, but stops further dirtying completely if there are no
> more dirty pages.

That behaviour is perfectly logical.  It prevents the number of
dirty+writeback pages from exceeding dirty_ratio.

> It also makes a deadlock possible when one filesystem is writing data
> through another, and the balance_dirty_pages() for the lower
> filesystem is stalling the writeback for the upper filesystem's
> data (*).

I still don't understand this one.  I got lost when belatedly told that
i_mutex had something to do with it.

> So the exit condition should instead be:
> 
>   submitted at least "write_chunk" number of pages
> OR
>     submitted ALL the dirty pages destined for this backing dev
>   AND
>     backing dev is not congested
> 
> To do this, introduce a new counter in writeback_control, which counts
> the number of dirty pages encountered during writeback.  This includes
> all dirty pages, even those which are already under writeback but have
> been dirtied again, and those which have been skipped due to having
> locked buffers.
> 
> If this counter is zero after trying to submit some pages for
> writeback, and the backing dev is uncongested, then don't wait any
> more.  After this, newly dirtied pages can quickly be written back to
> this backing dev.
> 
> If there are globally no more pages to submit for writeback
> (nr_reclaimable == 0), then also don't wait for ever, only while this
> backing dev is congested.
> 
> (*) For more info on this deadlock, see the following discussions:
> 
>   http://lkml.org/lkml/2007/3/1/9
>   http://lkml.org/lkml/2007/3/12/16
> 
> Signed-off-by: Miklos Szeredi <mszeredi@...e.cz>
> ---
> 
> Index: linux/include/linux/writeback.h
> ===================================================================
> --- linux.orig/include/linux/writeback.h	2007-03-24 22:06:56.000000000 +0100
> +++ linux/include/linux/writeback.h	2007-03-24 22:29:02.000000000 +0100
> @@ -44,6 +44,7 @@ struct writeback_control {
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> +	long nr_dirty;			/* Number of dirty pages encountered */
>  
>  	/*
>  	 * For a_ops->writepages(): is start or end are non-zero then this is
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
> +++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
> @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
>  		 * written to the server's write cache, but has not yet
>  		 * been flushed to permanent storage.
>  		 */
> -		if (nr_reclaimable) {
> +		if (!nr_reclaimable) {
> +			/*
> +			 * If there's nothing more to write back and this queue
> +			 * is uncongested,  then it is possible to quickly
> +			 * write out some more data, so let's not wait
> +			 */
> +			if (!bdi_write_congested(bdi))
> +				break;
> +		} else {

This says "if there are no dirty pages in the machine at all, then go back
and dirty some more pages, regardless of the present number of
under-writeback pages".

>  			writeback_inodes(&wbc);
>  			get_dirty_limits(&background_thresh,
>  					 	&dirty_thresh, mapping);
> @@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
>  			pages_written += write_chunk - wbc.nr_to_write;
>  			if (pages_written >= write_chunk)
>  				break;		/* We've done our duty */
> +
> +			/*
> +			 * If there are no more dirty pages for this backing
> +			 * backing dev, and the queue is not congested, then
> +			 * it is possible to quickly write out some more data
> +			 */
> +			if (!wbc.nr_dirty && !bdi_write_congested(bdi))
> +				break;

This says "if there are no pages dirty againt this device then go back and
dirty some more pages, regardless of the present number of under-writeback
pages".



IOW: this change will allow us to 100% fill all memory with under-writeback
pages.

The VM _should_ cope with that - it kinda used to.  But it is an untested
region of operation and the chances of bogus oom-killings are excellent.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/