linux-kernel - Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned during cgroup writeback association switches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160219215811.GA17997@ZenIV.linux.org.uk>
Date:	Fri, 19 Feb 2016 21:58:11 +0000
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Tejun Heo <tj@...nel.org>
Cc:	Jan Kara <jack@...e.cz>, Tahsin Erdogan <tahsin@...gle.com>,
	Jens Axboe <axboe@...nel.dk>, cgroups@...r.kernel.org,
	Theodore Ts'o <tytso@....edu>,
	Nauman Rafique <nauman@...gle.com>,
	linux-kernel@...r.kernel.org, Jan Kara <jack@...e.com>
Subject: Re: [PATCH block/for-4.5-fixes] writeback: keep superblock pinned
 during cgroup writeback association switches

On Fri, Feb 19, 2016 at 03:51:47PM -0500, Tejun Heo wrote:

> I see, I suppose that's what distinguishes s_active and s_umount
> usages - whether pinning should block umounting?

???

->s_active is plain and simple count of "I hold a long-term reference to
this superblock, don't you shut it down until I drop that".

->s_umount is held across some of the transitions in struct super_block
life cycle, including the actual process of shutdown.

> > If you need details on s_active/s_umount/etc., I can give you a braindump,
> > but I suspect your real question is a lot more specific.  Details, please...
> 
> So, the problem is that cgroup writeback path sometimes schedules a
> work item to change the cgroup an inode is associated.  Currently,
> only the inode was pinned and the underlying sb may go away while the
> work item is still pending.  The work item performs iput() at the end
> and that explodes if the underlying sb is already gone.
> 
> As writeback path relies on s_umount for synchronization anyway, I
> think that'd be the most natural way to hold onto the sb but
> unfortunately there's no way to pass on the down_read to the async
> execution context, so I made it grap s_active, which worked fine but
> it made the sb hang around until such work items are finished.  It's
> an unlikely race to hit but still broken.
> 
> The last option would be canceling / flushing these work items from sb
> shutdown path which is likely more involved.
> 
> What should it be doing?

Um...  What ordering requirements do you have?  You obviously shouldn't
let it continue past the shutdown - as the matter of fact, you *can't* let
it continue past generic_shutdown_super(), since any inode references
held at evict_inodes() time will make it very unhappy.  Attempts to do
any IO after that will make things a lot worse than unhappy - data structures
needed to do it might be gone (and if you hold a bit longer, filesystem
driver itself might very well be gone, along with the functions you were
going to call).

Grabbing ->s_active is a seriously bad idea for another reason - in
a situation when there's only one mount of given fs, plain umount() should
_not_ return 0 before fs shutdown is over.  Sure, it is possible that there's
a binding somewhere, or that it's a lazy umount, etc., but those are "you've
asked for it" situations; having plain umount of e.g. ext3 on a USB stick
return success before it is safe to pull that stick out is a Bloody Bad Idea,
for obvious usability reasons.

IOW, while fs shutdown may be async, making it *always* async would be a bad
bug.  And bumping ->s_active does just that.

I'd go for trylock inside that work + making generic_shutdown_super()
kill all such works.  I assume that it *can* be abandoned in situation
when we know that sync_filesystem() is about to be called and that
said sync_filesystem() won't, in turn, schedule any such works, of course...