linux-kernel - Re: [PATCH] writeback: Fix broken sync writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100222172938.GA2601@quack.suse.cz>
Date:	Mon, 22 Feb 2010 18:29:38 +0100
From:	Jan Kara <jack@...e.cz>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	tytso@....edu, Jan Kara <jack@...e.cz>,
	Jens Axboe <jens.axboe@...cle.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	jengelh@...ozas.de, stable@...nel.org, gregkh@...e.de
Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue 16-02-10 21:16:46, Linus Torvalds wrote:
> On Tue, 16 Feb 2010, tytso@....edu wrote:
> >
> > We've had this logic for a long time, and given the increase in disk
> > density, and spindle speeds, the 4MB limit, which might have made
> > sense 10 years ago, probably doesn't make sense now.
> 
> I still don't think that 4MB is enough on its own to suck quite that 
> much. Even a fast device should be perfectly happy with 4MB IOs, or it 
> must be sucking really badly. 
> 
> In order to see the kinds of problems that got quoted in the original 
> thread, there must be something else going on too, methinks (disk light 
> was "blinking").
  <snip>
> Are we perhaps ending up in a situation where we essentially wait 
> synchronously on just the inode itself being written out? That would 
> explain the "40kB/s" kind of behavior.
> 
> If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s 
> throughput.
  Yes, it is actually 400kB/s but still you're right that that seems too
low if the only problem were seeks. I was looking at the code and it's even
bigger mess than what I thought :(.

a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
(when 1024 is passed in nr_to_write). Writeback code kind of expects that
in WB_SYNC_ALL mode all dirty pages in the given range are written (the
same way as write_cache_pages does that).

b) because of delayed allocation, inode is redirtied during ->writepages
call and thus writeback_single_inode calls redirty_tail at it. Thus each
inode will be written at least twice (synchronously, which means a
transaction commit and a disk cache flush for each such write).

c) the livelock avoidace in writeback_inodes_wb does not work because the
function exists as soon as nr_to_write (set to 1024) gets to 0 and thus
the 'start' timestamp gets always renewed.

d) ext4_writepage never succeeds to write a page with delayed-allocated
data. So pageout() function never succeeds in cleaning a page on ext4.
I think that when other problems in writeback code make writeout slow (like
in Jan Engelhardt's case), this can bite us and I assume this might be the
reason why Jan saw kswapd active doing some work during his problems.

> But if we do a 4MB chunk (for the one file that had real dirty data in 
> it), and then do a few hundred trivial "write out the inode data 
> _synchronously_" (due to access time changes etc) in between until we hit 
> the file that has real dirty data again - now _that_ would explain 40kB/s 
> throughput. It's not just seeking around - it's not even trying to push 
> multiple IO's to get any elevator going or anything like that.
> 
> And then the patch that started this discussion makes sense: it improves 
> performance because in between those synchronous inode updates it now 
> writes big chunks. But again, it's mostly hiding us just doing insane 
> things.
  I'm not quite sure whether some of the above problems is really causing the
sync writeback to be so slow. At least problems a) and c) would be
worked-around by the patch setting nr_to_write to LONG_MAX for sync
writeback and the effect of b) would be mitigated to just two synchronous
inode writes instead of (dirty_pages + 8191) / 8192 + 1 sync writes.
  For c) I think the original patch is actually the right solution but I
agree that it just hides the other problems...

									Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/