[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250322-vfs-namespace-09ebc48e2c4c@brauner>
Date: Sat, 22 Mar 2025 11:16:21 +0100
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [GIT PULL] vfs namespace
Hey Linus,
/* Summary */
This expands the ability of anonymous mount namespaces:
- Creating detached mounts from detached mounts
Currently, detached mounts can only be created from attached mounts.
This limitaton prevents various use-cases. For example, the ability to
mount a subdirectory without ever having to make the whole filesystem
visible first.
The current permission modelis:
(1) Check that the caller is privileged over the owning user namespace
of it's current mount namespace.
(2) Check that the caller is located in the mount namespace of the mount
it wants to create a detached copy of.
While it is not strictly necessary to do it this way it is consistently
applied in the new mount api. This model will also be used when allowing
the creation of detached mount from another detached mount.
The (1) requirement can simply be met by performing the same check as
for the non-detached case, i.e., verify that the caller is privileged
over its current mount namespace.
To meet the (2) requirement it must be possible to infer the origin
mount namespace that the anonymous mount namespace of the detached mount
was created from.
The origin mount namespace of an anonymous mount is the mount namespace
that the mounts that were copied into the anonymous mount namespace
originate from.
In order to check the origin mount namespace of an anonymous mount
namespace the sequence number of the original mount namespace is
recorded in the anonymous mount namespace.
With this in place it is possible to perform an equivalent check (2') to
(2). The origin mount namespace of the anonymous mount namespace must be
the same as the caller's mount namespace. To establish this the sequence
number of the caller's mount namespace and the origin sequence number of
the anonymous mount namespace are compared.
The caller is always located in a non-anonymous mount namespace since
anonymous mount namespaces cannot be setns()ed into. The caller's mount
namespace will thus always have a valid sequence number.
The owning namespace of any mount namespace, anonymous or non-anonymous,
can never change. A mount attached to a non-anonymous mount namespace
can never change mount namespace.
If the sequence number of the non-anonymous mount namespace and the
origin sequence number of the anonymous mount namespace match, the
owning namespaces must match as well.
Hence, the capability check on the owning namespace of the caller's
mount namespace ensures that the caller has the ability to copy the
mount tree.
- Allow mount detached mounts on detached mounts
Currently, detached mounts can only be mounted onto attached mounts.
This limitation makes it impossible to assemble a new private rootfs
and move it into place. Instead, a detached tree must be created,
attached, then mounted open and then either moved or detached again.
Lift this restriction.
In order to allow mounting detached mounts onto other detached mounts
the same permission model used for creating detached mounts from
detached mounts can be used (cf. above).
Allowing to mount detached mounts onto detached mounts leaves three
cases to consider:
(1) The source mount is an attached mount and the target mount is a
detached mount. This would be equivalent to moving a mount between
different mount namespaces. A caller could move an attached mount to
a detached mount. The detached mount can now be freely attached to
any mount namespace. This changes the current delegatioh model
significantly for no good reason. So this will fail.
(2) Anonymous mount namespaces are always attached fully, i.e., it is
not possible to only attach a subtree of an anoymous mount
namespace. This simplifies the implementation and reasoning.
Consequently, if the anonymous mount namespace of the source
detached mount and the target detached mount are the identical the
mount request will fail.
(3) The source mount's anonymous mount namespace is different from the
target mount's anonymous mount namespace.
In this case the source anonymous mount namespace of the source
mount tree must be freed after its mounts have been moved to the
target anonymous mount namespace. The source anonymous mount
namespace must be empty afterwards.
By allowing to mount detached mounts onto detached mounts a caller may
do the following:
fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)
fd_tree1 and fd_tree2 refer to two different detached mount trees that
belong to two different anonymous mount namespace.
It is important to note that fd_tree1 and fd_tree2 both refer to the
root of their respective anonymous mount namespaces.
By allowing to mount detached mounts onto detached mounts the caller
may now do:
move_mount(fd_tree1, "", fd_tree2, "",
MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)
This will cause the detached mount referred to by fd_tree1 to be
mounted on top of the detached mount referred to by fd_tree2.
Thus, the detached mount fd_tree1 is moved from its separate anonymous
mount namespace into fd_tree2's anonymous mount namespace.
It also means that while fd_tree2 continues to refer to the root of
its respective anonymous mount namespace fd_tree1 doesn't anymore.
This has the consequence that only fd_tree2 can be moved to another
anonymous or non-anonymous mount namespace. Moving fd_tree1 will now
fail as fd_tree1 doesn't refer to the root of an anoymous mount
namespace anymore.
Now fd_tree1 and fd_tree2 refer to separate detached mount trees
referring to the same anonymous mount namespace.
This is conceptually fine. The new mount api does allow for this to
happen already via:
mount -t tmpfs tmpfs /mnt
mkdir -p /mnt/A
mount -t tmpfs tmpfs /mnt/A
fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)
Both fd_tree3 and fd_tree4 refer to two different detached mount trees
but both detached mount trees refer to the same anonymous mount
namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be
moved another mount namespace as fd_tree3 refers to the root of the
anonymous mount namespace just while fd_tree4 doesn't.
However, there's an important difference between the fd_tree3/fd_tree4
and the fd_tree1/fd_tree2 example.
Closing fd_tree4 and releasing the respective struct file will have no
further effect on fd_tree3's detached mount tree.
However, closing fd_tree3 will cause the mount tree and the respective
anonymous mount namespace to be destroyed causing the detached mount
tree of fd_tree4 to be invalid for further mounting.
By allowing to mount detached mounts on detached mounts as in the
fd_tree1/fd_tree2 example both struct files will affect each other.
Both fd_tree1 and fd_tree2 refer to struct files that have
FMODE_NEED_UNMOUNT set.
To handle this we use the fact that @fd_tree1 will have a parent mount
once it has been attached to @fd_tree2.
When dissolve_on_fput() is called the mount that has been passed in
will refer to the root of the anonymous mount namespace. If it doesn't
it would mean that mounts are leaked. So before allowing to mount
detached mounts onto detached mounts this would be a bug.
Now that detached mounts can be mounted onto detached mounts it just
means that the mount has been attached to another anonymous mount
namespace and thus dissolve_on_fput() must not unmount the mount tree
or free the anonymous mount namespace as the file referring to the
root of the namespace hasn't been closed yet.
If it had been closed yet it would be obvious because the mount
namespace would be NULL, i.e., the @fd_tree1 would have already been
unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent
mount it is safe to skip any cleanup as closing @fd_tree2 will take
care of all cleanup operations.
- Allow mount propagation for detached mount trees
In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto
targets that reside on shared mounts") I fixed a bug where propagating
the source mount tree of an anonymous mount namespace into a target
mount tree of a non-anonymous mount namespace could be used to trigger
an integer overflow in the non-anonymous mount namespace causing any new
mounts to fail.
The cause of this was that the propagation algorithm was unable to
recognize mounts from the source mount tree that were already propagated
into the target mount tree and then reappeared as propagation targets
when walking the destination propagation mount tree.
When fixing this I disabled mount propagation into anonymous mount
namespaces. Make it possible for anonymous mount namespace to receive
mount propagation events correctly. This is no also a correctness issue
now that we allow mounting detached mount trees onto detached mount
trees.
Mark the source anonymous mount namespace with MNTNS_PROPAGATING
indicating that all mounts belonging to this mount namespace are
currently in the process of being propagated and make the propagation
algorithm discard those if they appear as propagation targets.
/* Testing */
gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
This contains a merge conflict with the vfs-6.15.mount pull request:
diff --cc fs/mount.h
index 946dc8b792d7,96862eba2246..000000000000
--- a/fs/mount.h
+++ b/fs/mount.h
@@@ -22,11 -26,8 +26,12 @@@ struct mnt_namespace
wait_queue_head_t poll;
struct rcu_head mnt_ns_rcu;
};
+ u64 seq_origin; /* Sequence number of origin mount namespace */
u64 event;
+#ifdef CONFIG_FSNOTIFY
+ __u32 n_fsnotify_mask;
+ struct fsnotify_mark_connector __rcu *n_fsnotify_marks;
+#endif
unsigned int nr_mounts; /* # of mounts in the namespace */
unsigned int pending_mounts;
struct rb_node mnt_ns_tree_node; /* node in the mnt_ns_tree */
The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:
Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)
are available in the Git repository at:
git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount.namespace
for you to fetch changes up to 06b1ce966e3f8bfef261c111feb3d4b33ede0cd8:
Merge patch series "mount: handle mount propagation for detached mount trees" (2025-03-04 09:29:55 +0100)
Please consider pulling these changes from the signed vfs-6.15-rc1.mount.namespace tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.15-rc1.mount.namespace
----------------------------------------------------------------
Arnd Bergmann (1):
fs: namespace: fix uninitialized variable use
Christian Brauner (23):
Merge patch series "CONFIG_DEBUG_VFS at last"
fs: record sequence number of origin mount namespace
fs: add mnt_ns_empty() helper
fs: add assert for move_mount()
fs: add fastpath for dissolve_on_fput()
fs: add may_copy_tree()
fs: create detached mounts from detached mounts
selftests: create detached mounts from detached mounts
fs: support getname_maybe_null() in move_mount()
fs: mount detached mounts onto detached mounts
selftests: first test for mounting detached mounts onto detached mounts
selftests: second test for mounting detached mounts onto detached mounts
selftests: third test for mounting detached mounts onto detached mounts
selftests: fourth test for mounting detached mounts onto detached mounts
selftests: fifth test for mounting detached mounts onto detached mounts
selftests: sixth test for mounting detached mounts onto detached mounts
selftests: seventh test for mounting detached mounts onto detached mounts
Merge patch series "fs: expand abilities of anonymous mount namespaces"
fs: allow creating detached mounts from fsmount() file descriptors
mount: handle mount propagation for detached mount trees
selftests: add test for detached mount tree propagation
selftests: test subdirectory mounting
Merge patch series "mount: handle mount propagation for detached mount trees"
Mateusz Guzik (3):
vfs: add initial support for CONFIG_DEBUG_VFS
vfs: catch invalid modes in may_open()
vfs: use the new debug macros in inode_set_cached_link()
fs/inode.c | 15 +
fs/mount.h | 13 +
fs/namei.c | 2 +
fs/namespace.c | 367 ++++++++++--
fs/pnode.c | 10 +-
fs/pnode.h | 2 +-
include/linux/fs.h | 4 +
include/linux/vfsdebug.h | 45 ++
lib/Kconfig.debug | 9 +
.../selftests/mount_setattr/mount_setattr_test.c | 652 +++++++++++++++++++++
10 files changed, 1053 insertions(+), 66 deletions(-)
create mode 100644 include/linux/vfsdebug.h
Powered by blists - more mailing lists