lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090930141158.GG24383@mit.edu>
Date:	Wed, 30 Sep 2009 10:11:58 -0400
From:	Theodore Tso <tytso@....edu>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	Christoph Hellwig <hch@...radead.org>,
	Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"Li, Shaohua" <shaohua.li@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"richard@....demon.co.uk" <richard@....demon.co.uk>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>
Subject: Re: regression in page writeback

On Wed, Sep 30, 2009 at 01:26:57PM +0800, Wu Fengguang wrote:
> It's good to increase MAX_WRITEBACK_PAGES, however I'm afraid
> max_contig_writeback_mb may be a burden in future: either it is not
> necessary, or a per-bdi counterpart must be introduced for all
> filesystems.

The per-filesystem tunable was just a short-term hack; the reason why
I did it that way was it was clear that a global tunable wouldn't fly,
and rightly so --- what might be suitable for a slow USB stick might
be very different than a super-fast RAID array, and someone might very
well have both on the same system.

> And it's preferred to automatically handle slow devices well with the
> increased chunk size, instead of adding another parameter.

Agreed; long-term what we probably need is something which is
automatically tunable.  My thinking was that we should tune the the
initial nr_to_write parameter based on how many blocks could be
written in some time interval, which is tunable.  So if we decide that
1 second is a suitable time period to be writing out one inode's dirty
pages, then for a fast server-class SATA disk, we might want to set
nr_to_write to be around 128mb worth of pages.  For a laptop SATA
disk, it might be around 64mb, and for a really slow USB stick, it
might be more like 16mb.  For super-fast enterprise RAID array, 128mb
might be too small!

If we get timing and/or congestion information from the block layer,
it wouldn't be hard to figure out the optimal number of pages that
should be sent down to the filesystem, and to tune this automatically.

> I scratched up a patch to demo the ideas collected in recent discussions.
> Can you check if it serves your needs? Thanks.

Sure, I'll definitely play with it, thanks.

> The wbc.timeout (when used per-file) is mainly a safeguard against slow
> devices, which may take too long time to sync 128MB data.

Maybe I'm missing something, but I don't think the wbc.timeout
approach is sufficient.  Consider the scenario of someone who is
ripping a DVD disc to an 8 gig USB stick.  The USB stick will be very
slow, but since the file is contiguous the filesystem will very
happily try to push it out there 128MB at a time, and wbc.timeout
value isn't really going to help since a single call to writepages
could easily cause 128MB worth of data to be streamed out to the USB
stick.

This is why the MAX_WRITEBACK_PAGES really needs to be tuned on a
per-bdi basis; either manually, via a sysfs tunable, or automatically,
by auto-tuning based on how fast the storage device is or by some kind
of congestion-based approach.  This is certainly the best long-term
solution; my concern was that it might take a long-time for us to get
the auto-tunable just right, so in the meantime I added a
per-mounted-filesystem tunable and put the hack in the filesystem
layer.  I would like nothing better than to rip it out, once we have a
long-term solution.

Regards,

							- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ