linux-kernel - Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251215-zuzug-unklug-2b0ac36d882b@brauner>
Date: Mon, 15 Dec 2025 12:30:39 +0100
From: Christian Brauner <brauner@...nel.org>
To: Dan Klishch <danilklishch@...il.com>
Cc: legion@...nel.org, containers@...ts.linux-foundation.org, 
	ebiederm@...ssion.com, keescook@...omium.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, viro@...iv.linux.org.uk
Subject: Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount
 visibility

On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
> 
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.

The standard way of making it possible to mount procfs inside of a
container with a separate mount namespace that has a procfs inside it
with overmounted entries is to ensure that a fully-visible procfs
instance is present. This is for example what Incus does when nesting
containers is enabled. In systemd I implemented the same logic years
ago:

commit b71a0192c040f585397cfc6fc2ca025bf839733d
Author:     Christian Brauner <brauner@...nel.org>
AuthorDate: Mon Nov 28 12:36:47 2022 +0100
Commit:     Christian Brauner (Microsoft) <brauner@...nel.org>
CommitDate: Mon Dec 5 18:34:25 2022 +0100

    nspawn: mount temporary visible procfs and sysfs instance

    In order to mount procfs and sysfs in an unprivileged container the
    kernel requires that a fully visible instance is already present in the
    target mount namespace. Mount one here so the inner child can mount its
    own  instances. Later we umount the temporary  instances created here
    before we actually exec the payload. Since the rootfs is shared the
    umount will propagate into the container. Note, the inner child wouldn't
    be able to unmount the  instances on its own since it doesn't own the
    originating mount namespace. IOW, the outer child needs to do this.

    So far nspawn didn't run into this issue because it used MS_MOVE which
    meant that the shadow mount tree pinned a procfs and sysfs instance
    which the kernel would find. The shadow mount tree is gone with proper
    pivot_root() semantics.

    Signed-off-by: Christian Brauner (Microsoft) <brauner@...nel.org>

> 
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.

The container needs to inherit a fully-visible instance somehow if you
want nesting. Using an unprivileged LSM such as landlock to prevent any
access to the fully visible procfs instance is usually the better way.

My hope is that once signed bpf is more widely adopted that distros will
just start enabling blessed bpf programs that will just take on the
access protecting instead of the clumsy bind-mount protection mechanism.