[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190327172254.GC6742@quack2.suse.cz>
Date: Wed, 27 Mar 2019 18:22:54 +0100
From: Jan Kara <jack@...e.cz>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Dave Chinner <david@...morbit.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
syzbot <syzbot+7a8ba368b47fdefca61e@...kaller.appspotmail.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
Jan Kara <jack@...e.cz>, Jaegeuk Kim <jaegeuk@...nel.org>,
Joel Becker <jlbec@...lplan.org>, Mark Fasheh <mark@...heh.com>
Subject: Re: KASAN: use-after-free Read in path_lookupat
On Mon 25-03-19 23:02:11, Al Viro wrote:
> On Tue, Mar 26, 2019 at 09:48:23AM +1100, Dave Chinner wrote:
>
> > And when it comes to VFS inode reclaim, XFS does not implement
> > ->evict_inode because there is nothing at the VFS level to do.
> > And ->destroy_inode ends up doing cleanup work (e.g. freeing on-disk
> > inodes) which is non-trivial, blocking work, but then still requires
> > the struct xfs_inode to be written back to disk before it can bei
> > freed. So it just gets marked "reclaimable" and background reclaim
> > then takes care of it from there so we avoid synchronous IO in inode
> > reclaim...
> >
> > This works because don't track dirty inode metadata in the VFS
> > writeback code (it's tracked with much more precision in the XFS log
> > infrastructure) and we don't write back inodes from the VFS
> > infrastructure, either. It's all done based on internal state
> > outside the VFS.
> >
> > And, because of this, the VFS cannot assume that it can free
> > the struct inode after calling ->destroy_inode or even use
> > call_rcu() to run a filesystem destructor because the filesystem
> > may need to do work that needs to block and that's not allowed in an
> > RCU callback...
>
> In Linus' patch that's what you get with non-NULL ->destroy_inode
> + NULL ->destroy_inode_rcu, so XFS won't be screwed by that.
> Said that, yes, XFS adds another fun twist there (AFAICS, it's
> the only in-tree filesystem that pulls that off).
>
> I would really like some comments from f2fs and ocfs2 folks, as well
> as Jan - he's had much more recent contact with writeback code than
> I have... Could somebody explain what's going on in f2fs and ocfs2
> ->drop_inode()? It _should_ be just a predicate; looks like both
> are playing very odd games to work around writeback problems and
> I wonder if there's a cleaner solution for that. I can try and dig
> through maillist(s) archives, but I would really appreciate it
> if somebody could give a braindump on the issues dealt with in there...
OCFS2 is discussed elsewhere and should be relatively easy to deal with.
F2FS seems harder. The problem is that AFAICS they get inode references
from their garbage collection code which can get called during page
writeback. And then they need to drop these references and they can be the
last ones to hold the inode reference for an unlinked inode forcing flush
worker into inode cleanup. Which generally causes problems and was the
reason why writeback code does not take inode references but relies on
I_SYNC to protect it from inode reclaim instead (see commit 169ebd90131b
"writeback: Avoid iput() from flusher thread"). And they noticed the
problem as well and hacked around it... Now I don't know enough about F2FS
and its garbage collection to tell how they can avoid dropping inode
references from flush worker context. But that's the right solution for
avoiding deadlocks.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists