[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a06ceeb57ba62aeb6df00bd49faad1bb5073321c.camel@fejes.dev>
Date: Mon, 27 Oct 2025 11:49:20 +0100
From: Ferenc Fejes <ferenc@...es.dev>
To: Christian Brauner <brauner@...nel.org>
Cc: linux-fsdevel@...r.kernel.org, Josef Bacik <josef@...icpanda.com>, Jeff
Layton <jlayton@...nel.org>, Jann Horn <jannh@...gle.com>, Mike Yuan
<me@...dnzj.com>, Zbigniew Jędrzejewski-Szmek
<zbyszek@...waw.pl>, Lennart Poettering <mzxreary@...inter.de>, Daan De
Meyer <daan.j.demeyer@...il.com>, Aleksa Sarai <cyphar@...har.com>, Amir
Goldstein <amir73il@...il.com>, Tejun Heo <tj@...nel.org>, Johannes Weiner
<hannes@...xchg.org>, Thomas Gleixner <tglx@...utronix.de>, Alexander Viro
<viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, bpf@...r.kernel.org,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
netdev@...r.kernel.org, Arnd Bergmann <arnd@...db.de>
Subject: Re: [PATCH RFC DRAFT 00/50] nstree: listns()
On Fri, 2025-10-24 at 16:50 +0200, Christian Brauner wrote:
> > > Add a new listns() system call that allows userspace to iterate through
> > > namespaces in the system. This provides a programmatic interface to
> > > discover and inspect namespaces, enhancing existing namespace apis.
> > >
> > > Currently, there is no direct way for userspace to enumerate namespaces
> > > in the system. Applications must resort to scanning /proc/<pid>/ns/
> > > across all processes, which is:
> > >
> > > 1. Inefficient - requires iterating over all processes
> > > 2. Incomplete - misses inactive namespaces that aren't attached to any
> > > running process but are kept alive by file descriptors, bind mounts,
> > > or parent namespace references
> > > 3. Permission-heavy - requires access to /proc for many processes
> > > 4. No ordering or ownership.
> > > 5. No filtering per namespace type: Must always iterate and check all
> > > namespaces.
> > >
> > > The list goes on. The listns() system call solves these problems by
> > > providing direct kernel-level enumeration of namespaces. It is similar
> > > to listmount() but obviously tailored to namespaces.
> >
> > I've been waiting for such an API for years; thanks for working on it. I
> > mostly
> > deal with network namespaces, where points 2 and 3 are especially painful.
> >
> > Recently, I've used this eBPF snippet to discover (at most 1024, because of
> > the
> > verifier's halt checking) network namespaces, even if no process is
> > attached.
> > But I can't do anything with it in userspace since it's not possible to pass
> > the
> > inode number or netns cookie value to setns()...
>
> I've mentioned it in the cover letter and in my earlier reply to Josef:
>
> On v6.18+ kernels it is possible to generate and open file handles to
> namespaces. This is probably an api that people outside of fs/ proper
> aren't all that familiar with.
>
> In essence it allows you to refer to files - or more-general:
> kernel-object that may be referenced via files - via opaque handles
> instead of paths.
>
> For regular filesystem that are multi-instance (IOW, you can have
> multiple btrfs or ext4 filesystems mounted) such file handles cannot be
> used without providing a file descriptor to another object in the
> filesystem that is used to resolve the file handle...
>
> However, for single-instance filesystems like pidfs and nsfs that's not
> required which is why I added:
>
> FD_PIDFS_ROOT
> FD_NSFS_ROOT
>
> which means that you can open both pidfds and namespace via
> open_by_handle_at() purely based on the file handle. I call such file
> handles "exhaustive file handles" because they fully describe the object
> to be resolvable without any further information.
>
> They are also not subject to the capable(CAP_DAC_READ_SEARCH) permission
> check that regular file handles are and so can be used even by
> unprivileged code as long as the caller is sufficiently privileged over
> the relevant object (pid resolvable in caller's pid namespace of pidfds,
> or caller located in namespace or privileged over the owning user
> namespace of the relevant namespace for nsfs).
>
> File handles for namespaces have the following uapi:
>
> struct nsfs_file_handle {
> __u64 ns_id;
> __u32 ns_type;
> __u32 ns_inum;
> };
>
> #define NSFS_FILE_HANDLE_SIZE_VER0 16 /* sizeof first published struct */
> #define NSFS_FILE_HANDLE_SIZE_LATEST sizeof(struct nsfs_file_handle) /* sizeof
> latest published struct */
>
> and it is explicitly allowed to generate such file handles manually in
> userspace. When the kernel generates a namespace file handle via
> name_to_handle_at() till will return: ns_id, ns_type, and ns_inum but
> userspace is allowed to provide the kernel with a laxer file handle
> where only the ns_id is filled in but ns_type and ns_inum are zero - at
> least after this patch series.
>
> So for your case where you even know inode number, ns type, and ns id
> you can fill in a struct nsfs_file_handle and either look at my reply to
> Josef or in the (ugly) tests.
>
> fd = open_by_handle_at(FD_NSFS_ROOT, file_handle, O_RDONLY);
>
> and can open the namespace (provided it is still active).
>
> >
> > extern const void net_namespace_list __ksym;
> > static void list_all_netns()
> > {
> > struct list_head *nslist =
> > bpf_core_cast(&net_namespace_list, struct list_head);
> >
> > struct list_head *iter = nslist->next;
> >
> > bpf_repeat(1024) {
>
> This isn't needed anymore. I've implemented it in a bpf-friendly way so
> it's possible to add kfuncs that would allow you to iterate through the
> various namespace trees (locklessly).
>
> If this is merged then I'll likely design that bpf part myself.
Excellent, thanks for the detailed explanation, noted! Well I guess I have to
keep my eyes closer on recent ns changes, I was aware of pidfs but not the
helpers you just mentioned.
>
> > After this merged, do you see any chance for backports? Does it rely on
> > recent
> > bits which is hard/impossible to backport? I'm not aware of backported
> > syscalls
> > but this would be really nice to see in older kernels.
>
> Uhm, what downstream entities, managing kernels do is not my concern but
> for upstream it's certainly not an option. There's a lot of preparatory
> work that would have to be backported.
I was curious about the upstream option, but I see this isn't feasible. Anyway,
its great we will have this in the future, thanks for doing it!
Ferenc
Powered by blists - more mailing lists