linux-kernel - Re: regression in page writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20091006131840.GA14111@localhost>
Date:	Tue, 6 Oct 2009 21:18:40 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Theodore Tso <tytso@....edu>,
	Christoph Hellwig <hch@...radead.org>,
	Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"Li, Shaohua" <shaohua.li@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"richard@....demon.co.uk" <richard@....demon.co.uk>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>
Subject: Re: regression in page writeback

On Tue, Oct 06, 2009 at 08:55:19PM +0800, Jan Kara wrote:
> On Fri 02-10-09 11:27:14, Wu Fengguang wrote:
> > On Fri, Oct 02, 2009 at 06:17:39AM +0800, Jan Kara wrote:
> > > On Wed 30-09-09 13:32:23, Wu Fengguang wrote:
> > > > writeback: bump up writeback chunk size to 128MB
> > > > 
> > > > Adjust the writeback call stack to support larger writeback chunk size.
> > > > 
> > > > - make wbc.nr_to_write a per-file parameter
> > > > - init wbc.nr_to_write with MAX_WRITEBACK_PAGES=128MB
> > > >   (proposed by Ted)
> > > > - add wbc.nr_segments to limit seeks inside sparsely dirtied file
> > > >   (proposed by Chris)
> > > > - add wbc.timeout which will be used to control IO submission time
> > > >   either per-file or globally.
> > > >   
> > > > The wbc.nr_segments is now determined purely by logical page index
> > > > distance: if two pages are 1MB apart, it makes a new segment.
> > > > 
> > > > Filesystems could do this better with real extent knowledges.
> > > > One possible scheme is to record the previous page index in
> > > > wbc.writeback_index, and let ->writepage compare if the current and
> > > > previous pages lie in the same extent, and decrease wbc.nr_segments
> > > > accordingly. Care should taken to avoid double decreases in writepage
> > > > and write_cache_pages.
> > > > 
> > > > The wbc.timeout (when used per-file) is mainly a safeguard against slow
> > > > devices, which may take too long time to sync 128MB data.
> > > > 
> > > > The wbc.timeout (when used globally) could be useful when we decide to
> > > > do two sync scans on dirty pages and dirty metadata. XFS could say:
> > > > please return to sync dirty metadata after 10s. Would need another
> > > > b_io_metadata queue, but that's possible.
> > > > 
> > > > This work depends on the balance_dirty_pages() wait queue patch.
> > >   I don't know, I think it gets too complicated... I'd either use the
> > > segments idea or the timeout idea but not both (unless you can find real
> > > world tests in which both help).
>   I'm sorry for a delayed reply but I had to work on something else.
> 
> > Maybe complicated, but nr_segments and timeout each has their target
> > application.  nr_segments serves two major purposes:
> > - fairness between two large files, one is continuously dirtied,
> >   another is sparsely dirtied. Given the same amount of dirty pages,
> >   it could take vastly different time to sync them to the _same_
> >   device. The nr_segments check helps to favor continuous data.
> > - avoid seeks/fragmentations. To give each file fair chance of
> >   writeback, we have to abort a file when some nr_to_write or timeout
> >   is reached. However they are both not good abort conditions.
> >   The best is for filesystem to abort earlier in seek boundaries,
> >   and treat nr_to_write/timeout as large enough bottom lines.
> > timeout is mainly a safeguard in case nr_to_write is too large for
> > slow devices. It is not necessary if nr_to_write is auto-computed,
> > however timeout in itself serves as a simple throughput adapting
> > scheme.
>   I understand why you have introduced both segments and timeout value
> and a completely agree with your reasons to introduce them. I just think
> that when the system gets too complex (there will be several independent
> methods of determining when writeback should be terminated, and even
> though each method is simple on its own, their interactions needn't be
> simple...) it will be hard to debug all the corner cases - even more
> because they will manifest "just" by slow or unfair writeback. So I'd

I definitely agree on the complications. There are some known issues
as well as possibly some corner cases to be discovered. One problem I
noticed now is, what if all the files are sparsely dirtied? Then
a small nr_segments can only hurt.  Another problem is, the block
device file tend to have sparsely dirtied pages (with metadata on
them).  Not sure how to detect/handle such conditions..

> prefer a single metric to determine when to stop writeback of an inode
> even though it might be a bit more complicated.
>   For example terminating on writeout does not really get a file fair
> chance of writeback because it might have been blocked just because we were
> writing some heavily fragmented file just before. And your nr_segments

You mean timeout? I've dropped that idea in favor of an nr_to_write
adaptive to the bdi write speed :)

> check is just a rough guess of whether a writeback is going to be
> fragmented or not.

It could be made accurate if btrfs decreases it in its own writepages,
based on the extent info. Should also be possible for ext4.

>   So I'd rather implement in mpage_ functions a proper detection of how
> fragmented the writeback is and give each inode a limit on number of
> fragments which mpage_ functions would obey. We could even use a queue's
> NONROT flag (set for solid state disks) to detect whether we should expect
> higher or lower seek times.

Yes, mpage_* can also utilize nr_segments.

Anyway nr_segments is not perfect, I'll post a patch and let fs
developers decide whether it is convenient/useful :) 

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/