linux-kernel - Re: ext2 write performance regression from 2.6.32

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Wed, 16 Feb 2011 16:40:58 +0100
From:	Jan Kara <jack@...e.cz>
To:	Feng Tang <feng.tang@...el.com>
Cc:	Jan Kara <jack@...e.cz>, "op.q.liu@...il.com" <op.q.liu@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Wu, Fengguang" <fengguang.wu@...el.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"axboe@...nel.dk" <axboe@...nel.dk>
Subject: Re: ext2 write performance regression from 2.6.32

  Hello,

On Wed 16-02-11 10:20:55, Feng Tang wrote:
> On Tue, 15 Feb 2011 19:11:26 +0800
> Jan Kara <jack@...e.cz> wrote:
> > On Tue 15-02-11 14:46:41, Feng Tang wrote:
> > > After some debugging, here is one possible root cause for the dd
> > > performance drop between 2.6.30 and 2.6.32 (33/34/35 as well):
> > > in .30 the dd is a pure sequential operation while in .32 it isn't,
> > > and the change is related to the introduction of per-pdi flush.
> > > 
> > > I used a laptop with SDHC controller and run a simple dd of a
> > > double RAM size _file_ to a 1G SDHC card, the drop from .32 to .30
> > > is about 30%, from roughly 10MB/s to 7MB/s
> > > 
> > > I'm not very familiar with .30/.32 code, and here is a simple
> > > analysis:
> > > 
> > > When dd to a big ext2 file, there are 2 types of metadata will be
> > > updated besides the file data:
> > > 1. The ext2 global info like group descriptors and block bitmaps,
> > > whose buffer_header will be marked dirty in ext2_new_blocks()
> > > 2. The inode of the file under written, marked dirty in
> > > ext2_write/update_inode(), which is called by write_inode() and in
> > > writeback path.
> > > 
> > > In 2.6.30, with old pdflush interface, during the dd, the writeback
> > > of the 2 types of metadata will be triggered from wb_timer_fn() and
> > > dirty_balance_pages(), but they are always delayed in
> > > pdflush_operations() as the pdflush_list is empty. So that only the
> > > file data got be written back in a very smooth sequential mode. 
> > > 
> > > In 2.6.32, the writeback is per-bdi operation, every time the bdi
> > > for the sd card is called for flush, it will check and try to write
> > > back all the dirty pages, including both the metadata and data
> > > pages, so the previously sequential sd block access is periodically
> > > chimed in by the metadata block, which cause the performance drop.
> > > And if I ugly delayed the metadata writeback, the performance will
> > > be restored same as .30.
> >   Umm, interesting. 7 vs 10 MB/s is rather big difference. For
> > non-rotating media like is your SD card, I'd expect much less impact
> > of IO randomness, especially if we write in those 4 MB chunks. But we
> > are probably hit by the erase block size being big and thus FTL has
> > to do a lot of work.
> Yes, the impact is a little big, the original report from kyle is drop
> from 18 MB/s to 3 MB/s, and even a 35% drop on SATA disk.
> 
> > 
> > What might happen is that flusher thread competes with the process
> > doing writeback from balance_dirty_pages(). There are basically two
> > dirty inodes in the bdi in your test case - the file you write and
> > the device inode. So while one task flushes the file data pages, the
> > other task has no other choice but flush the device inode. But I'd
> > expect this to happen with pdflush as well. Can you send me raw block
> > traces from both kernels so that I can have a look? Thanks.
> 
> The logs are big, so I put the log for .30 and .32 as attachments.
  Thanks for the logs. So indeed what happens is that with 2.6.32, flusher
thread competes with dd doing writeout. So one of the processes is writing
out file's data and the other gets the device inode with metadata. Thus the
result is a mix of data and metadata and unnecessarily seeky.

In 2.6.30, pdflush seemed to stay away from the bdi for most of the time
and dd did all the writeback. I'm not sure why that happened because the
code was not designed that way (and I have seen several loads where what
happened above with flusher thread happened with pdflush as well). It is
probably something specific to that kind of load and machine. Anyway, not
too important now since pdflush is dead ;).

To solve exactly this kind of problems, we decided to leave as much IO as
possible to the flusher thread (in particular avoid doing IO from
balance_dirty_pages()). I have experimental patches to do that so if you'd
be willing to try them out, you are welcome. The patches are attached.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR

View attachment "0001-writeback-account-per-bdi-accumulated-written-pages.patch" of type "text/x-patch" (2716 bytes)

View attachment "0002-mm-Properly-reflect-task-dirty-limits-in-dirty_excee.patch" of type "text/x-patch" (3652 bytes)

View attachment "0003-mm-Implement-IO-less-balance_dirty_pages.patch" of type "text/x-patch" (19730 bytes)

View attachment "0004-mm-Remove-low-limit-from-sync_writeback_pages.patch" of type "text/x-patch" (1621 bytes)

View attachment "0005-mm-Autotune-interval-between-distribution-of-page-co.patch" of type "text/x-patch" (9086 bytes)