linux-kernel - Re: [PATCH] writeback: permit through good bdi even when global dirty exceeded

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20111202101606.GA1158@localhost>
Date:	Fri, 2 Dec 2011 18:16:06 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Matthew Wilcox <matthew@....cx>, Jan Kara <jack@...e.cz>,
	LKML <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Theodore Ts'o <tytso@....edu>,
	Christoph Hellwig <hch@...radead.org>
Subject: Re: [PATCH] writeback: permit through good bdi even when global
 dirty exceeded

On Fri, Dec 02, 2011 at 04:29:50PM +0800, Wu Fengguang wrote:
> On Fri, Dec 02, 2011 at 03:03:59PM +0800, Andrew Morton wrote:
> > On Fri, 2 Dec 2011 14:36:03 +0800 Wu Fengguang <fengguang.wu@...el.com> wrote:
> > 
> > > --- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
> > > @@ -1182,6 +1182,14 @@ pause:
> > >  		if (task_ratelimit)
> > >  			break;
> > >  
> > > +		/*
> > > +		 * In the case of an unresponding NFS server and the NFS dirty
> > > +		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
> > > +		 * to go through, so that tasks on them still remain responsive.
> > > +		 */
> > > +		if (bdi_dirty < 8)
> > > +			break;
> > 
> > What happens if the local disk has nine dirty pages?
> 
> The 9 dirty pages will be cleaned by the flusher (likely in one shot),
> so after a while the dirtier task can dirty 8 pages more. This
> consumer-producer work flow can keep going on as long as the magic
> number chosen is >= 1.
> 
> > Also: please, no more magic numbers.  We have too many in there already.
> 
> Good point. Let's add some comment on the number chosen?

I did a dd test to the local disk (when w/ a stalled NFS mount) and
find that it always idle for several seconds before making a little
progress. It can be confirmed from the trace that the bdi_dirty
remains 8 even when the flusher has done its work.

So the number is lifted to bdi_stat_error to cover the errors in
bdi_dirty. Here goes the updated patch.

---
Subject: writeback: permit through good bdi even when global dirty exceeded
Date: Fri Dec 02 10:21:33 CST 2011

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (bdi_stat_error pages) is not
enough to enable the local disk to run in optimal throughput, but is
enough to make the system responsive on a broken NFS mount. The user can
then kill the dirtiers on the NFS mount and increase the global dirty
limit to bring up the local disk's throughput.

It risks allowing dirty pages to grow much larger than the global dirty
limit when there are 1000+ mounts, however that's very unlikely to happen,
especially in low memory profiles.

Signed-off-by: Wu Fengguang <fengguang.wu@...el.com>
---
 mm/page-writeback.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-12-02 17:01:09.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-02 18:07:01.000000000 +0800
@@ -1182,6 +1182,19 @@ pause:
 		if (task_ratelimit)
 			break;
 
+		/*
+		 * In the case of an unresponding NFS server and the NFS dirty
+		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
+		 * to go through, so that tasks on them still remain responsive.
+		 *
+		 * In theory 1 page is enough to keep the consumer-producer
+		 * pipe going: the flusher cleans 1 page => the task dirties 1
+		 * more page. However bdi_dirty has accounting errors.  So use
+		 * the larger and more IO friendly bdi_stat_error.
+		 */
+		if (bdi_dirty < bdi_stat_error(bdi))
+			break;
+
 		if (fatal_signal_pending(current))
 			break;
 	}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/