lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100216230017.GJ3153@quack.suse.cz>
Date:	Wed, 17 Feb 2010 00:00:18 +0100
From:	Jan Kara <jack@...e.cz>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Jan Kara <jack@...e.cz>, Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon 15-02-10 16:05:43, Linus Torvalds wrote:
> 
> 
> On Mon, 15 Feb 2010, Jan Kara wrote:
> > 
> > Personally, I don't see why VM shouldn't generate IO from a single inode
> > larger than a few MB for data integrity syncs. There are two reasons I know
> > about for MAX_WRITEBACK_PAGES:
> 
> Two issues:
>  - it shouldn't matter for performance (if you have a broken disk that 
>    wants multi-megabyte writes to get good performance, you need to fix 
>    the driver, not the VM)
  The IO size actually does matter for performance because if you switch
after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
that means a seek to a distant place and that has significant impact (seek
time are in a 10-20 ms range for a common drives so with 50 MB/s throughput
this is like 13-25% of the IO time...). A similar reason why it matters is
if the filesystem does delayed allocation - then allocation is performed
from writeback code and if it happens in 4 MB chunks, chances for
fragmentation are higher. Actually XFS, btrfs, and ext4 *workaround* the
fact that VM sends only 4 MB chunks and write more regardless of
nr_to_write set - I agree this really sucks but that's the way it is.

>  - I generally think that _latency_ is much more important than 
>    throughput, and huge writes are simply bad for latency.
  I agree that latency matters but is VM the right place to decide where to
break writes into smaller chunks for latency reasons? It knows neither
about the placement of the file nor about possible concurrent requests for
that device. So I personally believe that IO scheduler should be the layer
that should care about scheduling IO so that we have decent latecies. As
splitting bios is probably not an option, we might want to have an
artificial upper limit on bio size for latency reasons but it's still it's
way below VM...

> But the real complaint I had about your patch was that it made no sense. 
>
> Your statement that it speeds up sync is indicative of some _other_ 
> independent problem (perhaps starting from the beginning of the file each 
> time? Who knows?) and the patch _clearly_doesn't actually address the 
> underlying problem, it just changes the logic in a way that makes no 
> obvious sense to anybody, without actually explaining why it would matter.
> 
> So if you can explain _why_ that patch makes such a big difference for 
> you, and actually write that up, I wouldn't mind it. But right now it was 
> sent as a voodoo patch with no sensible explanation for why it would 
> really make any difference, since the outer loop should already do what 
> the patch does.
  OK, fair enough :). The patch is actually Jens' but looking at the code
again the nr_to_write trimming should rather happen inside of
writeback_inodes_wb, not outside of it. That would make the logic clearer
and magically also fix the problem because write_cache_pages() ignores
nr_to_write in WB_SYNC_ALL mode. But I guess it's better to not depend
on this and explicitely handle that in writeback_inodes_wb...
So I'll post a new patch with a better changelog shortly.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ