[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ab02765a247dbaebc7d1224ee20a3bc01adc330.camel@kernel.org>
Date: Mon, 27 Oct 2025 10:10:15 -0400
From: Jeff Layton <jlayton@...nel.org>
To: Christian Brauner <brauner@...nel.org>, linux-fsdevel@...r.kernel.org,
Josef Bacik <josef@...icpanda.com>
Cc: Jann Horn <jannh@...gle.com>, Mike Yuan <me@...dnzj.com>, Zbigniew
Jędrzejewski-Szmek <zbyszek@...waw.pl>, Lennart
Poettering <mzxreary@...inter.de>, Daan De Meyer
<daan.j.demeyer@...il.com>, Aleksa Sarai <cyphar@...har.com>, Amir
Goldstein <amir73il@...il.com>, Tejun Heo <tj@...nel.org>, Johannes Weiner
<hannes@...xchg.org>, Thomas Gleixner <tglx@...utronix.de>, Alexander Viro
<viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, bpf@...r.kernel.org,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
netdev@...r.kernel.org, Arnd Bergmann <arnd@...db.de>
Subject: Re: [PATCH v3 00/70] nstree: listns()
On Fri, 2025-10-24 at 12:52 +0200, Christian Brauner wrote:
> Hey,
>
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
>
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
>
> While a namespace is on the namespace trees with a valid reference count
> it is possible to reopen it through a namespace file handle. This is all
> fine but has some issues that should be addressed.
>
> On current kernels a namespace is visible to userspace in the
> following cases:
>
> (1) The namespace is in use by a task.
> (2) The namespace is persisted through a VFS object (namespace file
> descriptor or bind-mount).
> Note that (2) only cares about direct persistence of the namespace
> itself not indirectly via e.g., file->f_cred file references or
> similar.
> (3) The namespace is a hierarchical namespace type and is the parent of
> a single or multiple child namespaces.
>
> Case (3) is interesting because it is possible that a parent namespace
> might not fulfill any of (1) or (2), i.e., is invisible to userspace but
> it may still be resurrected through the NS_GET_PARENT ioctl().
>
> Currently namespace file handles allow much broader access to namespaces
> than what is currently possible via (1)-(3). The reason is that
> namespaces may remain pinned for completely internal reasons yet are
> inaccessible to userspace.
>
> For example, a user namespace my remain pinned by get_cred() calls to
> stash the opener's credentials into file->f_cred. As it stands file
> handles allow to resurrect such a users namespace even though this
> should not be possible via (1)-(3). This is a fundamental uapi change
> that we shouldn't do if we don't have to.
>
> Consider the following insane case: Various architectures support the
> CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction.
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. But because of ptrace()
> permission checks struct mm_struct stashes the user namespace of the
> task that struct mm_struct originally belonged to. The kernel thread
> will take a reference on the struct mm_struct and thus pin it.
>
> So on an idle system user namespaces can be persisted for arbitrary
> amounts of time which also means that they can be resurrected using
> namespace file handles. That makes no sense whatsoever. The problem is
> of course excarabted on large systems with a huge number of cpus.
>
> To handle this nicely we introduce an active reference count which
> tracks (1)-(3). This is easy to do as all of these things are already
> managed centrally. Only (1)-(3) will count towards the active reference
> count and only namespaces which are active may be opened via namespace
> file handles.
>
> The problem is that namespaces may be resurrected. Which means that they
> can become temporarily inactive and will be reactived some time later.
> Currently the only example of this is the SIOGCSKNS socket ioctl. The
> SIOCGSKNS ioctl allows to open a network namespace file descriptor based
> on a socket file descriptor.
>
> If a socket is tied to a network namespace that subsequently becomes
> inactive but that socket is persisted by another process in another
> network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the
> SIOCGSKNS ioctl will resurrect this network namespace.
>
> So calls to open_related_ns() and open_namespace() will end up
> resurrecting the corresponding namespace tree.
>
> Note that the active reference count does not regulate the lifetime of
> the namespace itself. This is still done by the normal reference count.
> The active reference count can only be elevated if the regular reference
> count is elevated.
>
> The active reference count also doesn't regulate the presence of a
> namespace on the namespace trees. It only regulates its visiblity to
> namespace file handles (and in later patches to listns()).
>
> A namespace remains on the namespace trees from creation until its
> actual destruction. This will allow the kernel to always reach any
> namespace trivially and it will also enable subsystems like bpf to walk
> the namespace lists on the system for tracing or general introspection
> purposes.
>
> Note that different namespaces have different visibility lifetimes on
> current kernels. While most namespace are immediately released when the
> last task using them exits, the user- and pid namespace are persisted
> and thus both remain accessible via /proc/<pid>/ns/<ns_type>.
>
> The user namespace lifetime is aliged with struct cred and is only
> released through exit_creds(). However, it becomes inaccessible to
> userspace once the last task using it is reaped, i.e., when
> release_task() is called and all proc entries are flushed. Similarly,
> the pid namespace is also visible until the last task using it has been
> reaped and the associated pid numbers are freed.
>
> The active reference counts of the user- and pid namespace are
> decremented once the task is reaped.
>
> Based on the namespace trees and the active reference count, a new
> listns() system call that allows userspace to iterate through namespaces
> in the system. This provides a programmatic interface to discover and
> inspect namespaces, enhancing existing namespace apis.
>
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
>
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
> running process but are kept alive by file descriptors, bind mounts,
> or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
> namespaces.
>
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.
>
> /*
> * @req: Pointer to struct ns_id_req specifying search parameters
> * @ns_ids: User buffer to receive namespace IDs
> * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
> * @flags: Reserved for future use (must be 0)
> */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
> size_t nr_ns_ids, unsigned int flags);
>
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
>
> /*
> * @size: Structure size
> * @ns_id: Starting point for iteration; use 0 for first call, then
> * use the last returned ID for subsequent calls to paginate
> * @ns_type: Bitmask of namespace types to include (from enum ns_type):
> * 0: Return all namespace types
> * MNT_NS: Mount namespaces
> * NET_NS: Network namespaces
> * USER_NS: User namespaces
> * etc. Can be OR'd together
> * @user_ns_id: Filter results to namespaces owned by this user namespace:
> * 0: Return all namespaces (subject to permission checks)
> * LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace
> * Other value: Namespaces owned by the specified user namespace ID
> */
> struct ns_id_req {
> __u32 size; /* sizeof(struct ns_id_req) */
> __u32 spare; /* Reserved, must be 0 */
> __u64 ns_id; /* Last seen namespace ID (for pagination) */
> __u32 ns_type; /* Filter by namespace type(s) */
> __u32 spare2; /* Reserved, must be 0 */
> __u64 user_ns_id; /* Filter by owning user namespace */
> };
>
> Example 1: List all namespaces
>
> void list_all_namespaces(void)
> {
> struct ns_id_req req = {
> .size = sizeof(req),
> .ns_id = 0, /* Start from beginning */
> .ns_type = 0, /* All types */
> .user_ns_id = 0, /* All user namespaces */
> };
> uint64_t ids[100];
> ssize_t ret;
>
> printf("All namespaces in the system:\n");
> do {
> ret = listns(&req, ids, 100, 0);
> if (ret < 0) {
> perror("listns");
> break;
> }
>
> for (ssize_t i = 0; i < ret; i++)
> printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]);
>
> /* Continue from last seen ID */
> if (ret > 0)
> req.ns_id = ids[ret - 1];
> } while (ret == 100); /* Buffer was full, more may exist */
> }
>
> Example 2 : List network namespaces only
>
> void list_network_namespaces(void)
> {
> struct ns_id_req req = {
> .size = sizeof(req),
> .ns_id = 0,
> .ns_type = NET_NS, /* Only network namespaces */
> .user_ns_id = 0,
> };
> uint64_t ids[100];
> ssize_t ret;
>
> ret = listns(&req, ids, 100, 0);
> if (ret < 0) {
> perror("listns");
> return;
> }
>
> printf("Network namespaces: %zd found\n", ret);
> for (ssize_t i = 0; i < ret; i++)
> printf(" netns ID: %llu\n", (unsigned long long)ids[i]);
> }
>
> Example 3 : List namespaces owned by current user namespace
>
> void list_owned_namespaces(void)
> {
> struct ns_id_req req = {
> .size = sizeof(req),
> .ns_id = 0,
> .ns_type = 0, /* All types */
> .user_ns_id = LISTNS_CURRENT_USER, /* Current userns */
> };
> uint64_t ids[100];
> ssize_t ret;
>
> ret = listns(&req, ids, 100, 0);
> if (ret < 0) {
> perror("listns");
> return;
> }
>
> printf("Namespaces owned by my user namespace: %zd\n", ret);
> for (ssize_t i = 0; i < ret; i++)
> printf(" ns ID: %llu\n", (unsigned long long)ids[i]);
> }
>
> Example 4 : List multiple namespace types
>
> void list_network_and_mount_namespaces(void)
> {
> struct ns_id_req req = {
> .size = sizeof(req),
> .ns_id = 0,
> .ns_type = NET_NS | MNT_NS, /* Network and mount */
> .user_ns_id = 0,
> };
> uint64_t ids[100];
> ssize_t ret;
>
> ret = listns(&req, ids, 100, 0);
> printf("Network and mount namespaces: %zd found\n", ret);
> }
>
> Example 5 : Pagination through large namespace sets
>
> void list_all_with_pagination(void)
> {
> struct ns_id_req req = {
> .size = sizeof(req),
> .ns_id = 0,
> .ns_type = 0,
> .user_ns_id = 0,
> };
> uint64_t ids[50];
> size_t total = 0;
> ssize_t ret;
>
> printf("Enumerating all namespaces with pagination:\n");
>
> while (1) {
> ret = listns(&req, ids, 50, 0);
> if (ret < 0) {
> perror("listns");
> break;
> }
> if (ret == 0)
> break; /* No more namespaces */
>
> total += ret;
> printf(" Batch: %zd namespaces\n", ret);
>
> /* Last ID in this batch becomes start of next batch */
> req.ns_id = ids[ret - 1];
>
> if (ret < 50)
> break; /* Partial batch = end of results */
> }
>
> printf("Total: %zu namespaces\n", total);
> }
>
> listns() respects namespace isolation and capabilities:
>
> (1) Global listing (user_ns_id = 0):
> - Requires CAP_SYS_ADMIN in the namespace's owning user namespace
> - OR the namespace must be in the caller's namespace context (e.g.,
> a namespace the caller is currently using)
> - User namespaces additionally allow listing if the caller has
> CAP_SYS_ADMIN in that user namespace itself
> (2) Owner-filtered listing (user_ns_id != 0):
> - Requires CAP_SYS_ADMIN in the specified owner user namespace
> - OR the namespace must be in the caller's namespace context
> - This allows unprivileged processes to enumerate namespaces they own
> (3) Visibility:
> - Only "active" namespaces are listed
> - A namespace is active if it has a non-zero __ns_ref_active count
> - This includes namespaces used by running processes, held by open
> file descriptors, or kept active by bind mounts
> - Inactive namespaces (kept alive only by internal kernel
> references) are not visible via listns()
>
> Signed-off-by: Christian Brauner <brauner@...nel.org>
> ---
> Changes in v3:
> - Expanded test-suite.
> - Moved active reference count tracking for task-attached namespaces to
> dedicated helpers.
> - Fixed active reference count leaks when creating a new process fails.
> - Allow to be rescheduled when walking a a long namespace list.
> - Grab reference count when accessing a namespace when walking the list.
> - Link to v2: https://patch.msgid.link/20251022-work-namespace-nstree-listns-v2-0-71a588572371@kernel.org
>
> Changes in v2:
> - Fully implement the active reference count.
> - Fix various minor issues.
> - Expand the testsuite to test complex resurrection scenarios due to SIOCGSKNS.
> - Currently each task takes an active reference on the user namespace as
> credentials can be persisted for a very long time and completely
> arbitrary reasons but we don't want to tie the lifetime of a user
> namespace being visible to userspace to the existence of some
> credentials being stashed somewhere. We want to tie it to it being
> in-use by actual tasks or vfs objects and then go away. There might be
> more clever ways of doing this but for now this is good enough.
> - TODO: Add detailed tests for multi-threaded namespace sharing.
> - Link to v1: https://patch.msgid.link/20251021-work-namespace-nstree-listns-v1-0-ad44261a8a5b@kernel.org
>
> ---
> Christian Brauner (70):
> libfs: allow to specify s_d_flags
> nsfs: use inode_just_drop()
> nsfs: raise DCACHE_DONTCACHE explicitly
> pidfs: raise DCACHE_DONTCACHE explicitly
> nsfs: raise SB_I_NODEV and SB_I_NOEXEC
> cgroup: add cgroup namespace to tree after owner is set
> nstree: simplify return
> ns: initialize ns_list_node for initial namespaces
> ns: add __ns_ref_read()
> ns: rename to exit_nsproxy_namespaces()
> ns: add active reference count
> ns: use anonymous struct to group list member
> nstree: introduce a unified tree
> nstree: allow lookup solely based on inode
> nstree: assign fixed ids to the initial namespaces
> ns: maintain list of owned namespaces
> nstree: add listns()
> arch: hookup listns() system call
> nsfs: update tools header
> selftests/filesystems: remove CLONE_NEWPIDNS from setup_userns() helper
> selftests/namespaces: first active reference count tests
> selftests/namespaces: second active reference count tests
> selftests/namespaces: third active reference count tests
> selftests/namespaces: fourth active reference count tests
> selftests/namespaces: fifth active reference count tests
> selftests/namespaces: sixth active reference count tests
> selftests/namespaces: seventh active reference count tests
> selftests/namespaces: eigth active reference count tests
> selftests/namespaces: ninth active reference count tests
> selftests/namespaces: tenth active reference count tests
> selftests/namespaces: eleventh active reference count tests
> selftests/namespaces: twelth active reference count tests
> selftests/namespaces: thirteenth active reference count tests
> selftests/namespaces: fourteenth active reference count tests
> selftests/namespaces: fifteenth active reference count tests
> selftests/namespaces: add listns() wrapper
> selftests/namespaces: first listns() test
> selftests/namespaces: second listns() test
> selftests/namespaces: third listns() test
> selftests/namespaces: fourth listns() test
> selftests/namespaces: fifth listns() test
> selftests/namespaces: sixth listns() test
> selftests/namespaces: seventh listns() test
> selftests/namespaces: eigth listns() test
> selftests/namespaces: ninth listns() test
> selftests/namespaces: first listns() permission test
> selftests/namespaces: second listns() permission test
> selftests/namespaces: third listns() permission test
> selftests/namespaces: fourth listns() permission test
> selftests/namespaces: fifth listns() permission test
> selftests/namespaces: sixth listns() permission test
> selftests/namespaces: seventh listns() permission test
> selftests/namespaces: first inactive namespace resurrection test
> selftests/namespaces: second inactive namespace resurrection test
> selftests/namespaces: third inactive namespace resurrection test
> selftests/namespaces: fourth inactive namespace resurrection test
> selftests/namespaces: fifth inactive namespace resurrection test
> selftests/namespaces: sixth inactive namespace resurrection test
> selftests/namespaces: seventh inactive namespace resurrection test
> selftests/namespaces: eigth inactive namespace resurrection test
> selftests/namespaces: ninth inactive namespace resurrection test
> selftests/namespaces: tenth inactive namespace resurrection test
> selftests/namespaces: eleventh inactive namespace resurrection test
> selftests/namespaces: twelth inactive namespace resurrection test
> selftests/namespace: first threaded active reference count test
> selftests/namespace: second threaded active reference count test
> selftests/namespace: third threaded active reference count test
> selftests/namespace: commit_creds() active reference tests
> selftests/namespace: add stress test
> selftests/namespace: test listns() pagination
>
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/tools/syscall_32.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> fs/libfs.c | 1 +
> fs/namespace.c | 8 +-
> fs/nsfs.c | 95 +-
> fs/pidfs.c | 1 +
> include/linux/ns_common.h | 166 +-
> include/linux/nsfs.h | 3 +
> include/linux/nsproxy.h | 5 +-
> include/linux/nstree.h | 26 +-
> include/linux/pseudo_fs.h | 1 +
> include/linux/syscalls.h | 4 +
> include/linux/user_namespace.h | 4 +-
> include/uapi/asm-generic/unistd.h | 4 +-
> include/uapi/linux/nsfs.h | 58 +
> init/version-timestamp.c | 5 +
> ipc/msgutil.c | 5 +
> ipc/namespace.c | 1 +
> kernel/cgroup/cgroup.c | 11 +-
> kernel/cgroup/namespace.c | 3 +-
> kernel/cred.c | 6 +
> kernel/exit.c | 3 +-
> kernel/fork.c | 3 +-
> kernel/nscommon.c | 227 +-
> kernel/nsproxy.c | 25 +-
> kernel/nstree.c | 540 +++-
> kernel/pid.c | 10 +
> kernel/pid_namespace.c | 1 +
> kernel/time/namespace.c | 6 +
> kernel/user.c | 5 +
> kernel/user_namespace.c | 1 +
> kernel/utsname.c | 1 +
> net/core/net_namespace.c | 3 +-
> scripts/syscall.tbl | 1 +
> tools/include/uapi/linux/nsfs.h | 70 +
> tools/testing/selftests/filesystems/utils.c | 2 +-
> tools/testing/selftests/namespaces/.gitignore | 7 +
> tools/testing/selftests/namespaces/Makefile | 20 +-
> .../selftests/namespaces/cred_change_test.c | 814 ++++++
> .../selftests/namespaces/listns_pagination_bug.c | 138 +
> .../selftests/namespaces/listns_permissions_test.c | 759 ++++++
> tools/testing/selftests/namespaces/listns_test.c | 679 +++++
> .../selftests/namespaces/ns_active_ref_test.c | 2672 ++++++++++++++++++++
> .../testing/selftests/namespaces/siocgskns_test.c | 1824 +++++++++++++
> tools/testing/selftests/namespaces/stress_test.c | 626 +++++
> tools/testing/selftests/namespaces/wrappers.h | 35 +
> 60 files changed, 8835 insertions(+), 60 deletions(-)
> ---
> base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
> change-id: 20251020-work-namespace-nstree-listns-9fd71518515c
This looks pretty great overall, Christian. Nice work!
I hate the fact that we have to deal with resurrection here since it
makes things much messier, but I don't see a great alternative. I found
the nsfs filehandle format, btw, so that seems fine.
You can add this to patches 1-19, though I'd still prefer that you
split the ns_owner_tree handling out of patch #17 and into a separate
patch.
Reviewed-by: Jeff Layton <jlayton@...nel.org>
Powered by blists - more mailing lists