[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231127-lahmlegen-redeverbot-43de4b46063c@brauner>
Date: Mon, 27 Nov 2023 17:33:36 +0100
From: Christian Brauner <brauner@...nel.org>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Gabriel Krisman Bertazi <gabriel@...sman.be>,
Linus Torvalds <torvalds@...ux-foundation.org>, tytso@....edu,
linux-f2fs-devel@...ts.sourceforge.net, ebiggers@...nel.org,
linux-fsdevel@...r.kernel.org, jaegeuk@...nel.org,
linux-ext4@...r.kernel.org,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Miklos Szeredi <miklos@...redi.hu>
Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev]
[PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs
On Mon, Nov 27, 2023 at 06:38:42AM +0000, Al Viro wrote:
> On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:
>
> > d_invalidate() situation is more subtle - we need to sort out its interplay
> > with d_splice_alias().
> >
> > More concise variant of the scenario in question:
> > * we have /mnt/foo/bar and a lot of its descendents in dcache on client
> > * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
> > * somebody on the client does a lookup of /mnt/foo/bar and gets told by
> > the server that there's no directory with that name anymore.
> > * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
> > evicting its descendents
> > * We try to mount something on /mnt/foo/baz/blah. We look up baz, get
> > an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
> > d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
> > it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and
> > mount on top of it.
> > * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
> > submounts to dissolve. And finds the mount we'd done. Which mount quietly
> > disappears.
> >
> > Note that from the server POV the thing had been moved quite a while ago.
> > No server-side races involved - all it seeem is a couple of LOOKUP in the
> > same directory, one for the old name, one for the new.
> >
> > On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
> > which had been there on server at the time we started and which remains in
> > place after mount we'd created suddenly disappears.
> >
> > For the thread that ended up calling d_invalidate(), they'd been doing e.g.
> > stat on a pathname that used to be there a while ago, but currently isn't.
> > They get -ENOENT and no indication that something odd might have happened.
> >
> > >From ->d_revalidate() point of view there's also nothing odd happening -
> > dentry is not a mountpoint, it stays in place until we return and there's
> > no directory entry with that name on in its parent. It's as clear-cut
> > as it gets - dentry is stale.
> >
> > The only overlap happening there is d_splice_alias() hitting in the middle
> > of already started d_invalidate().
> >
> > For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
> > and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
> > to do with it, but the same problem existed prior to that.
> >
> > FWIW, I suspect that the right answer would be along the lines of
> > * if d_splice_alias() does move an exsiting (attached) alias in
> > place, it ought to dissolve all mountpoints in subtree being moved.
> > There might be subtleties, but in case when that __d_unalias() happens
> > due to rename on server this is definitely the right thing to do.
> > * d_invalidate() should *NOT* do anything with dentry that
> > got moved (including moved by d_splice_alias()) from the place we'd
> > found it in dcache. At least d_invalidate() done due to having
> > ->d_revalidate() return 0.
> > * d_invalidate() should dissolve all mountpoints in the
> > subtree that existed when it got started (and found the victim
> > still unmoved, that is). It should (as it does) prevent any
> > new mountpoints added in that subtree, unless the mountpoint
> > to be had been moved (spliced) out. What it really shouldn't
> > do is touch the mountpoints that are currently outside of it
> > due to moves.
> >
> > I'm going to look around and see if we have any weird cases where
> > d_splice_alias() is used for things like "correct the case of
> > dentry name on a case-mangled filesystem" - that would presumably
> > not want to dissolve any submounts. I seem to recall seeing
> > some shite of that sort, but that was a long time ago.
> >
> > Eric, Miklos - it might be a good idea if you at least took a
> > look at whatever comes out of that (sub)thread; I'm trying to
> > reconstruct the picture, but the last round of serious reworking
> > of that area had been almost 10 years ago and your recollections
> > of the considerations back then might help. I realize that they
> > are probably rather fragmentary (mine definitely are) and any
> > analysis will need to be redone on the current tree, but...
>
> TBH, I wonder if we ought to have d_invalidate() variant that would
> unhash the dentry in question, do a variant of shrink_dcache_parent()
> that would report if there had been any mountpoints and if there
> had been any, do namespace_lock() and go hunting for mounts in that
> subtree, moving corresponding struct mountpoint to a private list
> as we go (removing them from mountpoint hash chains, that it). Then
> have them all evicted after we'd finished walking the subtree...
That sounds reasonable.
>
> The tricky part will be lock ordering - right now we have the
> mountpoint hash protected by mount_lock (same as mount hash, probably
> worth splitting anyway) and that nests outside of ->d_lock.
>
> Note that we don't do mountpoint hash lookups on mountpoint crossing
> - it's nowhere near the really hot paths. What we have is
> lookup_mountpoint() - plain hash lookup. Always
> under namespace_lock() and mount_lock.
> get_mountpoint() - there's an insertion into hash chain,
> with dentry passed through the d_set_mounted(), which would
> fail if we have d_invalidate() on the subtree.
> Also always under namespace_lock() and mount_lock.
> __put_mountpoint() - removal from the hash chain.
> We remove from hash chain after having cleared DCACHE_MOUNTED.
> _That_ can happen under mount_lock alone (taking out the stuck
> submounts on final mntput()).
>
> So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under
> ->d_lock. Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?)
> In d_walk() callback we would
> * do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION
> is.
> * otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the
> mountpoint hash chain matching that dentry, find struct mountpoint in it,
> remove it from the chain and insert into a separate "collector" chain, all
> without messing with refcount.
Ok.
> In lookup_mountpoint() and get_mountpoint() take the bitlock on chain.
> In __put_mountpoint(), once it has grabbed ->d_lock
> * check if it has DCACHE_MOUNT_INVALIDATION, use that to
> decide which chain we are locking - the normal one or the collector
> * clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION
> * remove from chain
> * unlock the chain
> * drop ->d_lock.
>
> Once we are finished walking the tree, go over the collector list
> and do what __detach_mount() guts do. We are no longer under
> any ->d_lock, so locking is not a problem. namespace_unlock() will
> flush them all, same as it does for __detach_mount().
Ok.
Powered by blists - more mailing lists