[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251215144600.911100-1-danilklishch@gmail.com>
Date: Mon, 15 Dec 2025 09:46:00 -0500
From: Dan Klishch <danilklishch@...il.com>
To: legion@...nel.org,
brauner@...nel.org
Cc: containers@...ts.linux-foundation.org,
ebiederm@...ssion.com,
keescook@...omium.org,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org,
viro@...iv.linux.org.uk
Subject: Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
>> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
>>> But then, if I understand you correctly, this patch will not be enough
>>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
>>> /proc/cpuinfo, etc.
>>
>> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
>> tree to the sandboxed programs (empirically, this is enough for most of
>> programs you want sandboxing for). With that in mind, this patch and a
>> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
>> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
>> to expose more static files still look like a clean solution to me.
>
> I don't think you'll be able to do that. procfs doesn't allow itself to
> be overlayed [1]. What should block mounting overlayfs and fuse on top
> of procfs.
>
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
This is why I have been careful not to say overlayfs. With [2] (warning:
zero-shot ChatGPT output), I can do:
$ ./fuse-overlay target --source=/proc
$ ls target
1 88 194 1374 889840 908552
2 90 195 1375 889987 908619
3 91 196 1379 890031 908658
4 92 203 1412 890063 908756
5 93 205 1590 890085 908804
6 94 233 1644 890139 908951
7 96 237 1802 890246 909848
8 97 239 1850 890271 909914
10 98 240 1852 894665 909924
13 99 243 1865 895854 909926
15 100 244 1888 895864 910005
16 102 246 1889 896030 acpi
17 103 262 1891 896205 asound
18 104 263 1895 896508 bus
19 105 264 1896 896544 driver
20 106 265 1899 896706 dynamic_debug
<...>
[2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
This requires a much more careful thought wrt magic symlinks
and permission checks. The fact that I am highly unlikely to 100%
correctly reimplement the checks and special behavior of procfs makes me
not want to proceed with the FUSE route.
On 12/15/25 6:30 AM, Christian Brauner wrote:
> The standard way of making it possible to mount procfs inside of a
> container with a separate mount namespace that has a procfs inside it
> with overmounted entries is to ensure that a fully-visible procfs
> instance is present.
Yes, this is a solution. However, this is only marginally better than
passing --privileged to the outer container (in a sense that we require
outer sandbox to remove some protections for the inner sandbox to work).
> The container needs to inherit a fully-visible instance somehow if you
> want nesting. Using an unprivileged LSM such as landlock to prevent any
> access to the fully visible procfs instance is usually the better way.
>
> My hope is that once signed bpf is more widely adopted that distros will
> just start enabling blessed bpf programs that will just take on the
> access protecting instead of the clumsy bind-mount protection mechanism.
These are big changes to container runtimes that are unlikely to happen
soon. In contrast, the patch we are discussing will be available in 2
months after the merge for me to use on ArchLinux, and in a couple more
months on Ubuntu.
So, is there any way forward with the patch or should I continue trying
to find a userspace solution?
Thanks,
Dan Klishch
Powered by blists - more mailing lists