linux-kernel - Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110711170042.GE5482@quack.suse.cz>
Date:	Mon, 11 Jul 2011 19:00:42 +0200
From:	Jan Kara <jack@...e.cz>
To:	Curt Wohlgemuth <curtw@...gle.com>
Cc:	Jan Kara <jack@...e.cz>, Christoph Hellwig <hch@...radead.org>,
	Al Viro <viro@...iv.linux.org.uk>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	fengguang.wu@...el.com
Subject: Re: [PATCH] writeback: Don't wait for completion in
 writeback_inodes_sb_nr

  Hi Curt,

On Fri 01-07-11 15:55:33, Curt Wohlgemuth wrote:
> On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@...e.cz> wrote:
> > On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
> >> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
> >> > > For sys_sync I'm pretty sure we could simply remove the
> >> > > writeback_inodes_sb call and get just as good if not better performance,
> >> >   Actually, it won't with current code. Because WB_SYNC_ALL writeback
> >> > currently has the peculiarity that it looks like:
> >> >   for all inodes {
> >> >     write all inode data
> >> >     wait for inode data
> >> >   }
> >> > while to achieve good performance we actually need something like
> >> >   for all inodes
> >> >     write all inode data
> >> >   for all inodes
> >> >     wait for inode data
> >> > It makes a difference in an order of magnitude when there are lots of
> >> > smallish files - SLES had a bug like this so I know from user reports ;)
> >>
> >> I don't think that's true.  The WB_SYNC_ALL writeback is done using
> >> sync_inodes_sb, which operates as:
> >>
> >>   for all dirty inodes in bdi:
> >>      if inode belongs to sb
> >>         write all inode data
> >>
> >>   for all inodes in sb:
> >>      wait for inode data
> >>
> >> we still do that in a big for each sb loop, though.
> >  True but writeback_single_inode() has in it:
> >        if (wbc->sync_mode == WB_SYNC_ALL) {
> >                int err = filemap_fdatawait(mapping);
> >                if (ret == 0)
> >                        ret = err;
> >        }
> >  So we end up waiting much earlier. Probably we should remove this wait
> > but that will need some audit I guess.
> 
> So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
>   - queue a work item; this will
>     - write all dirty inodes in a sb
>       - write one inode's pages
>       - wait on all inode's pages
>   - wait for the work item
>   - wait on all inodes in the sb (wait_sb_inodes)
> 
> I guess the point of wait_sb_inodes() is to wait on all inodes that
> were written in a previous writeback pass.
  Yes. As we discussed with Christoph, waiting for each inode could be
avoided in principle but we have to be careful not to break data
consistency guarantees in some other path calling
writeback_single_inode()...

> One other issue I have with sync as it's structured is that we don't
> do a WB_SYNC_ALL pass on any inode that's only associated with a block
> device, and not on a mounted filesystem.  Blockdev mounts are
> pseudo-mounts, and are explicitly skipped in __sync_filesystem().  So
> if you've written directly to a block device and do a sync, the only
> pass over the pages for this inode are via the
> wakeup_flusher_threads() -- which operates on a BDI, regardless of the
> superblock, and uses WB_SYNC_NONE.
> 
> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
> exclude pseudo-superblocks.
> 
> I've seen cases in our modified kernels here at Google in which
> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
> /dev/sda (though I haven't been able to come up with a consistent test
> case, nor reproduce this on an upstream kernel).
  Ho hum, I wonder what would be the cleanest way to fix this. I guess we
could make an exception for 'bdev' superblock in the code but it's ugly.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/