[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f708a1119b2ad8cf2514b1df128a4ef7cf21c636.camel@fejes.dev>
Date: Wed, 22 Oct 2025 13:00:01 +0200
From: Ferenc Fejes <ferenc@...es.dev>
To: Christian Brauner <brauner@...nel.org>, linux-fsdevel@...r.kernel.org,
Josef Bacik <josef@...icpanda.com>, Jeff Layton <jlayton@...nel.org>
Cc: Jann Horn <jannh@...gle.com>, Mike Yuan <me@...dnzj.com>, Zbigniew
Jędrzejewski-Szmek <zbyszek@...waw.pl>, Lennart
Poettering <mzxreary@...inter.de>, Daan De Meyer
<daan.j.demeyer@...il.com>, Aleksa Sarai <cyphar@...har.com>, Amir
Goldstein <amir73il@...il.com>, Tejun Heo <tj@...nel.org>, Johannes Weiner
<hannes@...xchg.org>, Thomas Gleixner <tglx@...utronix.de>, Alexander Viro
<viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, bpf@...r.kernel.org,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
netdev@...r.kernel.org, Arnd Bergmann <arnd@...db.de>
Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns()
On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
>
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
>
> I need helper here!: Consider the following current design:
>
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
>
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
>
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
>
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
>
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
>
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
>
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
>
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
>
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
>
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
>
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
>
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
>
> So two options I see if the api is based on ids:
>
> (1) We use the active reference count and somehow also make it work with
> sockets.
> (2) The active reference count is not needed and we say that listns() is
> an introspection system call anyway so we just always list
> namespaces regardless of why they are still pinned: files,
> mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
>
> =====================================================================
>
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
>
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
>
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
> running process but are kept alive by file descriptors, bind mounts,
> or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
> namespaces.
>
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.
I've been waiting for such an API for years; thanks for working on it. I mostly
deal with network namespaces, where points 2 and 3 are especially painful.
Recently, I've used this eBPF snippet to discover (at most 1024, because of the
verifier's halt checking) network namespaces, even if no process is attached.
But I can't do anything with it in userspace since it's not possible to pass the
inode number or netns cookie value to setns()...
extern const void net_namespace_list __ksym;
static void list_all_netns()
{
struct list_head *nslist =
bpf_core_cast(&net_namespace_list, struct list_head);
struct list_head *iter = nslist->next;
bpf_repeat(1024) {
const struct net *net =
bpf_core_cast(container_of(iter, struct net, list), struct
net);
// bpf_printk("net: %p inode: %u cookie: %lu",
// net, net->ns.inum, net->net_cookie);
if (iter->next == nslist)
break;
iter = iter->next;
}
}
>
> /*
> * @req: Pointer to struct ns_id_req specifying search parameters
> * @ns_ids: User buffer to receive namespace IDs
> * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
> * @flags: Reserved for future use (must be 0)
> */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
> size_t nr_ns_ids, unsigned int flags);
>
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
>
> /*
> * @size: Structure size
> * @ns_id: Starting point for iteration; use 0 for first call, then
> * use the last returned ID for subsequent calls to paginate
> * @ns_type: Bitmask of namespace types to include (from enum ns_type):
> * 0: Return all namespace types
> * MNT_NS: Mount namespaces
> * NET_NS: Network namespaces
> * USER_NS: User namespaces
> * etc. Can be OR'd together
> * @user_ns_id: Filter results to namespaces owned by this user namespace:
> * 0: Return all namespaces (subject to permission checks)
> * LISTNS_CURRENT_USER: Namespaces owned by caller's user
> namespace
> * Other value: Namespaces owned by the specified user namespace
> ID
> */
> struct ns_id_req {
> __u32 size; /* sizeof(struct ns_id_req) */
> __u32 spare; /* Reserved, must be 0 */
> __u64 ns_id; /* Last seen namespace ID (for pagination) */
> __u32 ns_type; /* Filter by namespace type(s) */
> __u32 spare2; /* Reserved, must be 0 */
> __u64 user_ns_id; /* Filter by owning user namespace */
> };
>
After this merged, do you see any chance for backports? Does it rely on recent
bits which is hard/impossible to backport? I'm not aware of backported syscalls
but this would be really nice to see in older kernels.
Ferenc
Powered by blists - more mailing lists