lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101119004552.GF5004@quack.suse.cz>
Date:	Fri, 19 Nov 2010 01:45:52 +0100
From:	Jan Kara <jack@...e.cz>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Nick Piggin <npiggin@...nel.dk>, Ted Ts'o <tytso@....edu>,
	Eric Sandeen <sandeen@...hat.com>, Jan Kara <jack@...e.cz>,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	linux-btrfs@...r.kernel.org
Subject: Re: [patch] fix up lock order reversal in writeback

On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> I'm not sure that s_umount versus i_mutex has come up before.
> 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.
> 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?
> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
  As I wrote to Chris, the function just needs exclusion from umount /
remount happening (and want to stop umount from returning EBUSY when
writeback thread is writing something out). When the function is called
from ->write_begin this is no issue as you properly noted so s_umount is
not needed in that particular case.

> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
  Possibly, but currently the advantage is that we can have WARN_ON in the
writeback code that complains if someone starts writeback without properly
pinned superblock and we cannot easily do that with your general rule. I'm
not saying that should stop us from changing the rule but it was kind of
nice.

> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.
  I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
variants) does:
        bdi_queue_work(sb->s_bdi, &work);
        wait_for_completion(&done);
  So we return only after all the IO has been submitted and unlock s_umount
in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
because we are holding i_mutex and we need to get and put references
to other inodes while doing writeback (those would be really horrible lock
dependencies - writeback thread can put the last reference to an unlinked
inode...).

In fact, as I'm speaking about it, pushing things to writeback thread and
waiting on the work does not help a bit with the locking issues (we didn't
wait for the work previously but that had other issues). Bug, sigh.

What might be better interface for usecases like above is to allow
filesystem to kick flusher thread to start doing background writeback
(regardless of dirty limits). Then the caller can wait for some delayed
allocation reservations to get freed (easy enough to check in
->writepage() and wake the waiters) - possibly with a reasonable timeout
so that we don't stall forever.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ