[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100425033315.GC667@thunk.org>
Date:	Sat, 24 Apr 2010 23:33:15 -0400
From:	tytso@....edu
To:	Dave Chinner <david@...morbit.com>
Cc:	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	xfs@....sgi.com
Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in
 write_cache_pages
On Tue, Apr 20, 2010 at 12:41:53PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@...hat.com>
> 
> If a filesystem writes more than one page in ->writepage, write_cache_pages
> fails to notice this and continues to attempt writeback when wbc->nr_to_write
> has gone negative - this trace was captured from XFS:
> 
> 
>     wbc_writeback_start: towrt=1024
>     wbc_writepage: towrt=1024
>     wbc_writepage: towrt=0
>     wbc_writepage: towrt=-1
>     wbc_writepage: towrt=-5
>     wbc_writepage: towrt=-21
>     wbc_writepage: towrt=-85
> 
> This has adverse effects on filesystem writeback behaviour. write_cache_pages()
> needs to terminate after a certain number of pages are written, not after a
> certain number of calls to ->writepage are made. Make it observe the current
> value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
> termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
> integrity syncs.
Be careful here.  If you are going to write more pages than what the
writeback code has requested (the stupid no more than 1024 pages
restriction in the writeback code before it jumps to start writing
some other inode), you actually need to let the returned
wbc->nr_to_write go negative, so that wb_writeback() knows how many
pages it has written.
In other words, the writeback code assumes that 
  <orignal value of nr_to_write> - <returned wbc->nr_to_write>
is
  <number of pages actually written>
If you don't let wbc->nr_to_write go negative, the writeback code will
be confused about how many pages were _actually_ written, and the
writeback code ends up writing too much.  See commit 2faf2e1.
All of this is a crock of course.  The file system shouldn't be
second-guessing the writeback code.  Instead the writeback code should
be adaptively measuring how long it takes to were written out N pages
to a particular block device, and then decide what's the appropriate
setting for nr_to_write.  What makes sense for a USB stick, or a 4200
RPM laptop drive, may not make sense for a massive RAID array....
But since we don't have that, both XFS and ext4 have workarounds for
brain-damaged writeback behaviour.  (I did some testing, and even for
standard laptop drives the cap of 1024 pages is just Way Too Small;
that limit was set something like a decade ago, and everyone has been
afraid to change it, even though disks have gotten a wee bit faster
since those days.)
    	   	      	       	     	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
