[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251215-zuzug-unklug-2b0ac36d882b@brauner>
Date: Mon, 15 Dec 2025 12:30:39 +0100
From: Christian Brauner <brauner@...nel.org>
To: Dan Klishch <danilklishch@...il.com>
Cc: legion@...nel.org, containers@...ts.linux-foundation.org,
ebiederm@...ssion.com, keescook@...omium.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, viro@...iv.linux.org.uk
Subject: Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount
visibility
On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
>
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.
The standard way of making it possible to mount procfs inside of a
container with a separate mount namespace that has a procfs inside it
with overmounted entries is to ensure that a fully-visible procfs
instance is present. This is for example what Incus does when nesting
containers is enabled. In systemd I implemented the same logic years
ago:
commit b71a0192c040f585397cfc6fc2ca025bf839733d
Author: Christian Brauner <brauner@...nel.org>
AuthorDate: Mon Nov 28 12:36:47 2022 +0100
Commit: Christian Brauner (Microsoft) <brauner@...nel.org>
CommitDate: Mon Dec 5 18:34:25 2022 +0100
nspawn: mount temporary visible procfs and sysfs instance
In order to mount procfs and sysfs in an unprivileged container the
kernel requires that a fully visible instance is already present in the
target mount namespace. Mount one here so the inner child can mount its
own instances. Later we umount the temporary instances created here
before we actually exec the payload. Since the rootfs is shared the
umount will propagate into the container. Note, the inner child wouldn't
be able to unmount the instances on its own since it doesn't own the
originating mount namespace. IOW, the outer child needs to do this.
So far nspawn didn't run into this issue because it used MS_MOVE which
meant that the shadow mount tree pinned a procfs and sysfs instance
which the kernel would find. The shadow mount tree is gone with proper
pivot_root() semantics.
Signed-off-by: Christian Brauner (Microsoft) <brauner@...nel.org>
>
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.
The container needs to inherit a fully-visible instance somehow if you
want nesting. Using an unprivileged LSM such as landlock to prevent any
access to the fully visible procfs instance is usually the better way.
My hope is that once signed bpf is more widely adopted that distros will
just start enabling blessed bpf programs that will just take on the
access protecting instead of the clumsy bind-mount protection mechanism.
Powered by blists - more mailing lists