linux-kernel - Re: PID namespace init releases its file locks before its children die

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d2fa498d-1acf-4c92-ae8c-2d91be1449df@gmail.com>
Date: Fri, 3 Oct 2025 13:09:27 -0400
From: Demi Marie Obenour <demiobenour@...il.com>
To: Oleg Nesterov <oleg@...hat.com>, Christian Brauner <brauner@...nel.org>,
 Mateusz Guzik <mjguzik@...il.com>
Cc: Linux kernel mailing list <linux-kernel@...r.kernel.org>,
 Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: PID namespace init releases its file locks before its children
 die

On 10/3/25 08:38, Oleg Nesterov wrote:
> Add CCs.
> 
> I can't really help, just my 2 cents...
> 
> I don't think we can change do_exit() to call exit_files() after
> exit_notify().

Not surprised.

> At first glance, technically it is possible to change do_exit() so
> that the exiting reaper does zap_pid_ns_processes() earlier... But
> even if this is possible, I think that this complication needs more
> justification.

I have a service that must not be run more than once concurrently.
I'm using s6 [1] as the service manager.  s6 doesn't support cgroups,
but it does support running the child in a PID namespace.
I was hoping that if the init process in the PID namespace took an
exclusive file lock, it would ensure that all the children in the PID
namespace stopped running before the lock is released.  Unfortunately,
with the current implementation that is not the case.

Right now, I'm leaking the file descriptor into the child processes and
relying on them to not close it.  This is somewhat fragile, though.
For instance, anything using GSubprocess breaks this assumption.
GSubprocess closes all file descriptors not explicitly passed into
the child.

It is definitely possible to implement this with cgroups: wait for
the cgroup to become empty before spawning the child.  It is also
possible for the supervisor to ensure that the child is dead before
spawning a new one, though s6's architecture makes this non-trivial.
The parent of the child is not PID 1, so it would need to inform
PID 1 to kill the child (and wait for it) if the actual supervisor
dies.

> Oleg.
> 
> On 10/02, Demi Marie Obenour wrote:
>>
>> I noticed that PID 1 in a PID namespace can release file locks (due
>> to exiting) while its children are still running for a bit.  If the
>> locks held by PID 1 were relied to serialize the execution of its
>> child processes, this could result in data corruption.
>>
>> Specifically, the child processes are killed via exit_notify() ->
>> forget_original_parent() -> find_child_reaper() ->
>> zap_pid_ns_processes().  That comes *after* exit_files(), which
>> releases the file locks.
>>
>> While it is possible to implement this with cgroups, cgroups
>> are quite a bit more complicated to use, at least compared to
>> a single call to unshare() before fork().
>>
>> Is this intentional?  Changing the behavior would make supervision
>> trees significantly easier to properly implement.
>> --
>> Sincerely,
>> Demi Marie Obenour (she/her/hers)
> 

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Download attachment "OpenPGP_0xB288B55FFF9C22C1.asc" of type "application/pgp-keys" (7141 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (834 bytes)