linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100217043009.GZ5337@thunk.org>
Date:	Tue, 16 Feb 2010 23:30:09 -0500
From:	tytso@....edu
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Jan Kara <jack@...e.cz>, Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue, Feb 16, 2010 at 07:35:35PM -0800, Linus Torvalds wrote:
> >   writeback_single_inode()
> >     ...writes 1024 pages.
> >     if we haven't written everything in the inode (more than 1024 dirty
> >     pages) we end up doing either requeue_io() or redirty_tail(). In the
> >     first case the inode is put to b_more_io list, in the second case to
> >     the tail of b_dirty list. In either case it will not receive further
> >     writeout until we go through all other members of current b_io list.
> > 
> >   So I claim we currently *do* switch to another inode after 4 MB. That
> > is a fact.
> 
> Ok, I think that's the bug. I do agree that it may well be intentional, 
> but considering the performance impact, I suspect it's been "intentional 
> without any performance numbers".

This is well known amongst file system developers.  We've even raised
it from time to time, but apparently most people are too scared to
touch the writeback code.  I proposed upping the limit some six months
ago, but I got serious pushback.  As a result, I followed XFS's lead,
and so now, both XFS and ext4 will write more pages than what is
requested by the writeback logic, to work around this bug.....

What we really want to do is to time how fast the device is.  If the
device is some Piece of Sh*t USB stick, then maybe you only want to
write 4MB at a time to avoid latency problems.  Heck, maybe you only
want to write 32k at a time, if it's really slow....  But if it's some
super-fast RAID array, maybe you want to write a lot more than 4MB at
a time.

We've had this logic for a long time, and given the increase in disk
density, and spindle speeds, the 4MB limit, which might have made
sense 10 years ago, probably doesn't make sense now.

> If it's bad for synchronous syncs, then it's bad for background syncing 
> too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely - 
> since the whole latency argument goes away if we don't always honor it 
> ("Oh, we have good latency - _except_ if you do 'sync()' to synchronously 
> write something out" - that's just insane).

I tried arguing for this six months ago, and got the argument that it
might cause latency problems on slow USB sticks.  So I added a forced
override for ext4, which now writes 128MB at a time --- with a sysfs
tuning knob that allow the old behaviour to be restored if users
really complained.  No one did actually complain....

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/