linux-kernel - Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BANLkTinndb4oHi6kNQrTsL86UrKDB=2rK6-gA8FM1Nj8CKYZ+w@mail.gmail.com>
Date:	Fri, 1 Jul 2011 15:55:33 -0700
From:	Curt Wohlgemuth <curtw@...gle.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Christoph Hellwig <hch@...radead.org>,
	Al Viro <viro@...iv.linux.org.uk>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	fengguang.wu@...el.com
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr

Hi Jan:

On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@...e.cz> wrote:
> On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> > > For sys_sync I'm pretty sure we could simply remove the
>> > > writeback_inodes_sb call and get just as good if not better performance,
>> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
>> > currently has the peculiarity that it looks like:
>> >   for all inodes {
>> >     write all inode data
>> >     wait for inode data
>> >   }
>> > while to achieve good performance we actually need something like
>> >   for all inodes
>> >     write all inode data
>> >   for all inodes
>> >     wait for inode data
>> > It makes a difference in an order of magnitude when there are lots of
>> > smallish files - SLES had a bug like this so I know from user reports ;)
>>
>> I don't think that's true.  The WB_SYNC_ALL writeback is done using
>> sync_inodes_sb, which operates as:
>>
>>   for all dirty inodes in bdi:
>>      if inode belongs to sb
>>         write all inode data
>>
>>   for all inodes in sb:
>>      wait for inode data
>>
>> we still do that in a big for each sb loop, though.
>  True but writeback_single_inode() has in it:
>        if (wbc->sync_mode == WB_SYNC_ALL) {
>                int err = filemap_fdatawait(mapping);
>                if (ret == 0)
>                        ret = err;
>        }
>  So we end up waiting much earlier. Probably we should remove this wait
> but that will need some audit I guess.

So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
  - queue a work item; this will
    - write all dirty inodes in a sb
      - write one inode's pages
      - wait on all inode's pages
  - wait for the work item
  - wait on all inodes in the sb (wait_sb_inodes)

I guess the point of wait_sb_inodes() is to wait on all inodes that
were written in a previous writeback pass.

One other issue I have with sync as it's structured is that we don't
do a WB_SYNC_ALL pass on any inode that's only associated with a block
device, and not on a mounted filesystem.  Blockdev mounts are
pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
if you've written directly to a block device and do a sync, the only
pass over the pages for this inode are via the
wakeup_flusher_threads() -- which operates on a BDI, regardless of the
superblock, and uses WB_SYNC_NONE.

All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
exclude pseudo-superblocks.

I've seen cases in our modified kernels here at Google in which
lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
/dev/sda (though I haven't been able to come up with a consistent test
case, nor reproduce this on an upstream kernel).

Thanks,
Curt

>
>> >   You mean that sync(1) would actually write the data itself? It would
>> > certainly make some things simpler but it has its problems as well - for
>> > example sync racing with flusher thread writing back inodes can create
>> > rather bad IO pattern...
>>
>> Only the second pass.  The idea is that we first try to use the flusher
>> threads for good I/O patterns, but if we can't get that to work only
>> block the caller and not everyone.  But that's just an idea so far,
>> it would need serious benchmark.  And despite what I claimed before
>> we actually do the wait in the caller context already anyway, which
>> already gives you the easy part of the above effect.
>  Modulo the writeback_single_inode() wait. But if that is dealt with I
> agree.
>
>                                                                Honza
> --
> Jan Kara <jack@...e.cz>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/