linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100216230017.GJ3153@quack.suse.cz>
Date:	Wed, 17 Feb 2010 00:00:18 +0100
From:	Jan Kara <jack@...e.cz>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Jan Kara <jack@...e.cz>, Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon 15-02-10 16:05:43, Linus Torvalds wrote:
> 
> 
> On Mon, 15 Feb 2010, Jan Kara wrote:
> > 
> > Personally, I don't see why VM shouldn't generate IO from a single inode
> > larger than a few MB for data integrity syncs. There are two reasons I know
> > about for MAX_WRITEBACK_PAGES:
> 
> Two issues:
>  - it shouldn't matter for performance (if you have a broken disk that 
>    wants multi-megabyte writes to get good performance, you need to fix 
>    the driver, not the VM)
  The IO size actually does matter for performance because if you switch
after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
that means a seek to a distant place and that has significant impact (seek
time are in a 10-20 ms range for a common drives so with 50 MB/s throughput
this is like 13-25% of the IO time...). A similar reason why it matters is
if the filesystem does delayed allocation - then allocation is performed
from writeback code and if it happens in 4 MB chunks, chances for
fragmentation are higher. Actually XFS, btrfs, and ext4 *workaround* the
fact that VM sends only 4 MB chunks and write more regardless of
nr_to_write set - I agree this really sucks but that's the way it is.

>  - I generally think that _latency_ is much more important than 
>    throughput, and huge writes are simply bad for latency.
  I agree that latency matters but is VM the right place to decide where to
break writes into smaller chunks for latency reasons? It knows neither
about the placement of the file nor about possible concurrent requests for
that device. So I personally believe that IO scheduler should be the layer
that should care about scheduling IO so that we have decent latecies. As
splitting bios is probably not an option, we might want to have an
artificial upper limit on bio size for latency reasons but it's still it's
way below VM...

> But the real complaint I had about your patch was that it made no sense. 
>
> Your statement that it speeds up sync is indicative of some _other_ 
> independent problem (perhaps starting from the beginning of the file each 
> time? Who knows?) and the patch _clearly_doesn't actually address the 
> underlying problem, it just changes the logic in a way that makes no 
> obvious sense to anybody, without actually explaining why it would matter.
> 
> So if you can explain _why_ that patch makes such a big difference for 
> you, and actually write that up, I wouldn't mind it. But right now it was 
> sent as a voodoo patch with no sensible explanation for why it would 
> really make any difference, since the outer loop should already do what 
> the patch does.
  OK, fair enough :). The patch is actually Jens' but looking at the code
again the nr_to_write trimming should rather happen inside of
writeback_inodes_wb, not outside of it. That would make the logic clearer
and magically also fix the problem because write_cache_pages() ignores
nr_to_write in WB_SYNC_ALL mode. But I guess it's better to not depend
on this and explicitely handle that in writeback_inodes_wb...
So I'll post a new patch with a better changelog shortly.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/