netdev - Re: [PATCH RFC DRAFT 00/50] nstree: listns()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f708a1119b2ad8cf2514b1df128a4ef7cf21c636.camel@fejes.dev>
Date: Wed, 22 Oct 2025 13:00:01 +0200
From: Ferenc Fejes <ferenc@...es.dev>
To: Christian Brauner <brauner@...nel.org>, linux-fsdevel@...r.kernel.org, 
 Josef Bacik <josef@...icpanda.com>, Jeff Layton <jlayton@...nel.org>
Cc: Jann Horn <jannh@...gle.com>, Mike Yuan <me@...dnzj.com>, Zbigniew
 Jędrzejewski-Szmek	 <zbyszek@...waw.pl>, Lennart
 Poettering <mzxreary@...inter.de>, Daan De Meyer	
 <daan.j.demeyer@...il.com>, Aleksa Sarai <cyphar@...har.com>, Amir
 Goldstein	 <amir73il@...il.com>, Tejun Heo <tj@...nel.org>, Johannes Weiner
	 <hannes@...xchg.org>, Thomas Gleixner <tglx@...utronix.de>, Alexander Viro
	 <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, bpf@...r.kernel.org,
  Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 netdev@...r.kernel.org, Arnd Bergmann	 <arnd@...db.de>
Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns()

On Tue, 2025-10-21 at 13:43 +0200, Christian Brauner wrote:
> Hey,
> 
> As announced a while ago this is the next step building on the nstree
> work from prior cycles. There's a bunch of fixes and semantic cleanups
> in here and a ton of tests.
> 
> I need helper here!: Consider the following current design:
> 
> Currently listns() is relying on active namespace reference counts which
> are introduced alongside this series.
> 
> The active reference count of a namespace consists of the live tasks
> that make use of this namespace and any namespace file descriptors that
> explicitly pin the namespace.
> 
> Once all tasks making use of this namespace have exited or reaped, all
> namespace file descriptors for that namespace have been closed and all
> bind-mounts for that namespace unmounted it ceases to appear in the
> listns() output.
> 
> My reason for introducing the active reference count was that namespaces
> might obviously still be pinned internally for various reasons. For
> example the user namespace might still be pinned because there are still
> open files that have stashed the openers credentials in file->f_cred, or
> the last reference might be put with an rcu delay keeping that namespace
> active on the namespace lists.
> 
> But one particularly strange example is CONFIG_MMU_LAZY_TLB_REFCOUNT=y.
> Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option
> which uses lazy TLB destruction.
> 
> When this option is set a userspace task's struct mm_struct may be used
> for kernel threads such as the idle task and will only be destroyed once
> the cpu's runqueue switches back to another task. So the kernel thread
> will take a reference on the struct mm_struct pinning it.
> 
> And for ptrace() based access checks struct mm_struct stashes the user
> namespace of the task that struct mm_struct belonged to originally and
> thus takes a reference to the users namespace and pins it.
> 
> So on an idle system such user namespaces can be persisted for pretty
> arbitrary amounts of time via struct mm_struct.
> 
> Now, without the active reference count regulating visibility all
> namespace that still are pinned in some way on the system will appear in
> the listns() output and can be reopened using namespace file handles.
> 
> Of course that requires suitable privileges and it's not really a
> concern per se because a task could've also persist the namespace
> recorded in struct mm_struct explicitly and then the idle task would
> still reuse that struct mm_struct and another task could still happily
> setns() to it afaict and reuse it for something else.
> 
> The active reference count though has drawbacks itself. Namely that
> socket files break the assumption that namespaces can only be opened if
> there's either live processes pinning the namespace or there are file
> descriptors open that pin the namespace itself as the socket SIOCGSKNS
> ioctl() can be used to open a network namespace based on a socket which
> only indirectly pins a network namespace.
> 
> So that punches a whole in the active reference count tracking. So this
> will have to be handled as right now socket file descriptors that pin a
> network namespace that don't have an active reference anymore (no live
> processes, not explicit persistence via namespace fds) can't be used to
> issue a SIOCGSKNS ioctl() to open the associated network namespace.
> 
> So two options I see if the api is based on ids:
> 
> (1) We use the active reference count and somehow also make it work with
>     sockets.
> (2) The active reference count is not needed and we say that listns() is
>     an introspection system call anyway so we just always list
>     namespaces regardless of why they are still pinned: files,
>     mm_struct, network devices, everything is fair game.
> (3) Throw hands up in the air and just not do it.
> 
> =====================================================================
> 
> Add a new listns() system call that allows userspace to iterate through
> namespaces in the system. This provides a programmatic interface to
> discover and inspect namespaces, enhancing existing namespace apis.
> 
> Currently, there is no direct way for userspace to enumerate namespaces
> in the system. Applications must resort to scanning /proc/<pid>/ns/
> across all processes, which is:
> 
> 1. Inefficient - requires iterating over all processes
> 2. Incomplete - misses inactive namespaces that aren't attached to any
>    running process but are kept alive by file descriptors, bind mounts,
>    or parent namespace references
> 3. Permission-heavy - requires access to /proc for many processes
> 4. No ordering or ownership.
> 5. No filtering per namespace type: Must always iterate and check all
>    namespaces.
> 
> The list goes on. The listns() system call solves these problems by
> providing direct kernel-level enumeration of namespaces. It is similar
> to listmount() but obviously tailored to namespaces.

I've been waiting for such an API for years; thanks for working on it. I mostly
deal with network namespaces, where points 2 and 3 are especially painful.

Recently, I've used this eBPF snippet to discover (at most 1024, because of the
verifier's halt checking) network namespaces, even if no process is attached.
But I can't do anything with it in userspace since it's not possible to pass the
inode number or netns cookie value to setns()...

extern const void net_namespace_list __ksym;
static void list_all_netns()
{
    struct list_head *nslist = 
	bpf_core_cast(&net_namespace_list, struct list_head);

    struct list_head *iter = nslist->next;

    bpf_repeat(1024) {
        const struct net *net = 
		bpf_core_cast(container_of(iter, struct net, list), struct
net);

        // bpf_printk("net: %p inode: %u cookie: %lu", 
	//	net, net->ns.inum, net->net_cookie);

        if (iter->next == nslist)
            break;
        iter = iter->next;
    }
}

> 
> /*
>  * @req: Pointer to struct ns_id_req specifying search parameters
>  * @ns_ids: User buffer to receive namespace IDs
>  * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return)
>  * @flags: Reserved for future use (must be 0)
>  */
> ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
>                size_t nr_ns_ids, unsigned int flags);
> 
> Returns:
> - On success: Number of namespace IDs written to ns_ids
> - On error: Negative error code
> 
> /*
>  * @size: Structure size
>  * @ns_id: Starting point for iteration; use 0 for first call, then
>  *         use the last returned ID for subsequent calls to paginate
>  * @ns_type: Bitmask of namespace types to include (from enum ns_type):
>  *           0: Return all namespace types
>  *           MNT_NS: Mount namespaces
>  *           NET_NS: Network namespaces
>  *           USER_NS: User namespaces
>  *           etc. Can be OR'd together
>  * @user_ns_id: Filter results to namespaces owned by this user namespace:
>  *              0: Return all namespaces (subject to permission checks)
>  *              LISTNS_CURRENT_USER: Namespaces owned by caller's user
> namespace
>  *              Other value: Namespaces owned by the specified user namespace
> ID
>  */
> struct ns_id_req {
>         __u32 size;         /* sizeof(struct ns_id_req) */
>         __u32 spare;        /* Reserved, must be 0 */
>         __u64 ns_id;        /* Last seen namespace ID (for pagination) */
>         __u32 ns_type;      /* Filter by namespace type(s) */
>         __u32 spare2;       /* Reserved, must be 0 */
>         __u64 user_ns_id;   /* Filter by owning user namespace */
> };
> 

After this merged, do you see any chance for backports? Does it rely on recent
bits which is hard/impossible to backport? I'm not aware of backported syscalls
but this would be really nice to see in older kernels.

Ferenc