netdev - Re: [PATCH RFC DRAFT 00/50] nstree: listns()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a06ceeb57ba62aeb6df00bd49faad1bb5073321c.camel@fejes.dev>
Date: Mon, 27 Oct 2025 11:49:20 +0100
From: Ferenc Fejes <ferenc@...es.dev>
To: Christian Brauner <brauner@...nel.org>
Cc: linux-fsdevel@...r.kernel.org, Josef Bacik <josef@...icpanda.com>, Jeff
 Layton <jlayton@...nel.org>, Jann Horn <jannh@...gle.com>, Mike Yuan
 <me@...dnzj.com>,  Zbigniew Jędrzejewski-Szmek	
 <zbyszek@...waw.pl>, Lennart Poettering <mzxreary@...inter.de>, Daan De
 Meyer	 <daan.j.demeyer@...il.com>, Aleksa Sarai <cyphar@...har.com>, Amir
 Goldstein	 <amir73il@...il.com>, Tejun Heo <tj@...nel.org>, Johannes Weiner
	 <hannes@...xchg.org>, Thomas Gleixner <tglx@...utronix.de>, Alexander Viro
	 <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, bpf@...r.kernel.org,
  Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 netdev@...r.kernel.org, Arnd Bergmann	 <arnd@...db.de>
Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns()

On Fri, 2025-10-24 at 16:50 +0200, Christian Brauner wrote:
> > > Add a new listns() system call that allows userspace to iterate through
> > > namespaces in the system. This provides a programmatic interface to
> > > discover and inspect namespaces, enhancing existing namespace apis.
> > > 
> > > Currently, there is no direct way for userspace to enumerate namespaces
> > > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > > across all processes, which is:
> > > 
> > > 1. Inefficient - requires iterating over all processes
> > > 2. Incomplete - misses inactive namespaces that aren't attached to any
> > >    running process but are kept alive by file descriptors, bind mounts,
> > >    or parent namespace references
> > > 3. Permission-heavy - requires access to /proc for many processes
> > > 4. No ordering or ownership.
> > > 5. No filtering per namespace type: Must always iterate and check all
> > >    namespaces.
> > > 
> > > The list goes on. The listns() system call solves these problems by
> > > providing direct kernel-level enumeration of namespaces. It is similar
> > > to listmount() but obviously tailored to namespaces.
> > 
> > I've been waiting for such an API for years; thanks for working on it. I
> > mostly
> > deal with network namespaces, where points 2 and 3 are especially painful.
> > 
> > Recently, I've used this eBPF snippet to discover (at most 1024, because of
> > the
> > verifier's halt checking) network namespaces, even if no process is
> > attached.
> > But I can't do anything with it in userspace since it's not possible to pass
> > the
> > inode number or netns cookie value to setns()...
> 
> I've mentioned it in the cover letter and in my earlier reply to Josef:
> 
> On v6.18+ kernels it is possible to generate and open file handles to
> namespaces. This is probably an api that people outside of fs/ proper
> aren't all that familiar with.
> 
> In essence it allows you to refer to files - or more-general:
> kernel-object that may be referenced via files - via opaque handles
> instead of paths.
> 
> For regular filesystem that are multi-instance (IOW, you can have
> multiple btrfs or ext4 filesystems mounted) such file handles cannot be
> used without providing a file descriptor to another object in the
> filesystem that is used to resolve the file handle...
> 
> However, for single-instance filesystems like pidfs and nsfs that's not
> required which is why I added:
> 
> FD_PIDFS_ROOT
> FD_NSFS_ROOT
> 
> which means that you can open both pidfds and namespace via
> open_by_handle_at() purely based on the file handle. I call such file
> handles "exhaustive file handles" because they fully describe the object
> to be resolvable without any further information.
> 
> They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
> check that regular file handles are and so can be used even by
> unprivileged code as long as the caller is sufficiently privileged over
> the relevant object (pid resolvable in caller's pid namespace of pidfds,
> or caller located in namespace or privileged over the owning user
> namespace of the relevant namespace for nsfs).
> 
> File handles for namespaces have the following uapi:
> 
> struct nsfs_file_handle {
> 	__u64 ns_id;
> 	__u32 ns_type;
> 	__u32 ns_inum;
> };
> 
> #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
> #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof
> latest published struct */
> 
> and it is explicitly allowed to generate such file handles manually in
> userspace. When the kernel generates a namespace file handle via
> name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
> userspace is allowed to provide the kernel with a laxer file handle
> where only the ns_id is filled in but ns_type and ns_inum are zero - at
> least after this patch series.
> 
> So for your case where you even know inode number, ns type, and ns id
> you can fill in a struct nsfs_file_handle and either look at my reply to
> Josef or in the (ugly) tests.
> 
> fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);
> 
> and can open the namespace (provided it is still active).
> 
> > 
> > extern const void net_namespace_list __ksym;
> > static void list_all_netns()
> > {
> >     struct list_head *nslist = 
> > 	bpf_core_cast(&net_namespace_list, struct list_head);
> > 
> >     struct list_head *iter = nslist->next;
> > 
> >     bpf_repeat(1024) {
> 
> This isn't needed anymore. I've implemented it in a bpf-friendly way so
> it's possible to add kfuncs that would allow you to iterate through the
> various namespace trees (locklessly).
> 
> If this is merged then I'll likely design that bpf part myself.

Excellent, thanks for the detailed explanation, noted! Well I guess I have to
keep my eyes closer on recent ns changes, I was aware of pidfs but not the
helpers you just mentioned.

> 
> > After this merged, do you see any chance for backports? Does it rely on
> > recent
> > bits which is hard/impossible to backport? I'm not aware of backported
> > syscalls
> > but this would be really nice to see in older kernels.
> 
> Uhm, what downstream entities, managing kernels do is not my concern but
> for upstream it's certainly not an option. There's a lot of preparatory
> work that would have to be backported.

I was curious about the upstream option, but I see this isn't feasible. Anyway,
its great we will have this in the future, thanks for doing it!

Ferenc