[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250421004419.GU2023217@ZenIV>
Date: Mon, 21 Apr 2025 01:44:19 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: Ian Kent <ikent@...hat.com>
Cc: Christian Brauner <brauner@...nel.org>, Ian Kent <raven@...maw.net>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Mark Brown <broonie@...nel.org>,
Eric Chanudet <echanude@...hat.com>, Jan Kara <jack@...e.cz>,
Clark Williams <clrkwllms@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
Alexander Larsson <alexl@...hat.com>,
Lucas Karpinski <lkarpins@...hat.com>, Aishwarya.TCV@....com
Subject: Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount
On Mon, Apr 21, 2025 at 08:12:25AM +0800, Ian Kent wrote:
> > One issue is of course that we need to guarantee that someone will
> > always put the last reference.
> >
> > Another is that the any scheme that elides the grace period in
> > namespace_unlock() will also mess up the fastpath in mntput_no_expire()
> > such that we either have to take lock_mount_hash() on each
> > mntput_no_expire() which is definitely a no-no or we have to come up
> > with an elaborate scheme where we do a read_seqbegin() and
> > read_seqcount_retry() dance based on mount_lock. Then we still need to
> > fallback on lock_mount_hash() if the sequence count has changed. It's
> > ugly and it will likely be noticable during RCU lookup.
>
> But the mounts are unhashed before we get to the scu sync, doesn't that
>
> buy us an opportunity for the seqlock dance to be simpler?
What does being unhashed have to do with anything? Both unhashing and
(a lot more relevant) clearing ->mnt_ns happens before the delay - the
whole point of mntput_no_expire() fastpath is that the absolute majority
of calls happens when the damn thing is still mounted and we are
definitely not dropping the last reference.
We use "has non-NULL ->mnt_ns" as a cheap way to check that the reference
that pins the mounted stuff is still there. The race we need to cope with
is
A: mntput_no_expire():
sees ->mnt_ns != NULL, decides that we can go for fast path
B: umount(2):
detaches the sucker from mount tree, clears ->mnt_ns, drops the
reference that used to be there for the sucker being mounted.
The only reference is one being dropped by A.
A: OK, so we are on the fast path; plain decrement is all we need
... and now mount has zero refcount and no one who would destroy it.
RCU delay between the removal from mount tree (no matter which test
you are using to detect that) and dropping the reference that used
to be due to being mounted avoids that - we have rcu_read_lock()
held over the entire fast path of mntput_no_expire(), both the
test (whichever test we want to use) and decrement. We get
A: rcu_read_lock()
A: see that victim is still in the tree (again, in whatever way you
want - doesn't matter)
A: OK, we know that grace period between the removal from tree and
dropping the reference couldn't have started before our rcu_read_lock().
Therefore, even if we are seeing stale data and it's being unmounted,
we are still guaranteed that refcount is at least 2 - the one we are
dropping and the one umount() couldn't have gotten around to dropping
yet. So the last reference will not be gone until we drop rcu_read_lock()
and we can safely decrement refcount without checking for zero.
And yes, that case of mntput() is _MUCH_ hotter than any umount-related
load might ever be. Many orders of magnitude hotter...
Powered by blists - more mailing lists