linux-kernel - Re: [PATCH v2 0/2] pid_namespace: namespacify sysctl kernel.pid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEivzxdPDC+sgRDYuv+RG57_RX0+RAdRDJTy8L4Bi=MffHmCuA@mail.gmail.com>
Date: Tue, 25 Feb 2025 19:01:21 +0100
From: Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: brauner@...nel.org, stgraber@...raber.org, tycho@...ho.pizza, 
	cyphar@...har.com, yun.zhou@...driver.com, joel.granados@...nel.org, 
	rostedt@...dmis.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/2] pid_namespace: namespacify sysctl kernel.pid_max

On Thu, Jan 30, 2025 at 6:45 PM Michal Koutný <mkoutny@...e.com> wrote:
>
> Hello.

Dear Michal,

(responding in this thread too, as a part of discussion around [1] revert)

[1] https://lore.kernel.org/all/20250221170249.890014-1-mkoutny@suse.com/

>
> On Fri, Nov 22, 2024 at 02:24:57PM +0100, Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com> wrote:
> >
> (Sorry for responding only now as I missed this until I read v6.14 news.)
>
> > The pid_max sysctl is a global value. For a long time the default value
> > has been 65535 and during the pidfd dicussions Linus proposed to bump
> > pid_max by default (cf. [1]). Based on this discussion systemd started
> > bumping pid_max to 2^22. So all new systems now run with a very high
> > pid_max limit with some distros having also backported that change.
>
> Yes, multiple [1] people [2] proposed even lifting the legacy limit in
> kernel directly.
>
> > Of course, giving containers the ability to restrict the number of
> > processes in their respective pid namespace indepent of the global limit
> > through pid_max is something desirable in itself and comes in handy in
> > general.
>
> Yes, this is what pids.max of a cgroup (already) does.

Not precisely, as it only limits the number of tasks in the cgroup,
while we talk
about pid *value* limit.

>
> (It is already difficult for users to troubleshoot which of multiple pid
> limits restricts their workload. I'm afraid making pid_max
> per-(hierarchical-)NS will contribute to confusion.)
> Also, the implementation copies the limit upon creation from
> parent, this pattern showed cumbersome with some attributes in legacy
> cgroup controllers e.g.  it's subject to race condition between parent's
> limit modification and children creation.

yeah, but it was intentional not to make this kernel change too big
and complex (and probably slow down things too).
Let's be honest that this pid_max setting is that kind of thing that
is rarely changed
and people use cgroups nowadays for that kind of stuff (and it is good!).

>
> > Independent of motivating use-cases the existence of pid namespaces
> > makes this also a good semantical extension and there have been prior
> > proposals pushing in a similar direction.
> > The trick here is to minimize the risk of regressions which I think is
> > doable. The fact that pid namespaces are hierarchical will help us here.
>
> I understand it is tempting to make pid_max part of a pid namespace but
> was the overlap with pids controller considered?

Of course, but as it was pointed out in the cover-letter, this patch
is not aimed to be
a replacement or suggested alternative to the pids controller.
Obviously, a cgroup way is the best
way to limit and control resources. This is about making an existing
pid_max limit to be namespace
aware to make user space happy. In the context of system containers
(LXC) it's a usual thing to do.
We see some kernel global limit or setting and consider if it's safe
to be namespaced in some way
and if it is safe and if it makes sense then we do it.

Second reason for having this is that we have a real use case scenario
with 32-bit Android Bionic libc
where we need to set a limit for PID *value*. And here, unfortunately,
pids controller does not help either.

>
> I'd consider the alternative of relying of virtualized PID numbers in
> pid namespaces with appropriate pids.max limit and numbers allocation
> strategy that would keep PID values below the limit (i.e. taking the
> first free pid number in given NS, actually I thought it is already the
> case but it doesn't work like that (when I try now [3])).
> WDYT?
>
> TL;DR instead of getting rid of the legacy limit, it was further
> extended to pid namespaces because of legacy workloads and it (almost)
> duplicates existing mechanism. Can this be rethought please?

I hope I explained above why I believe that this does not duplicate an
existing mechanism.

>
> Thanks,
> Michal

Kind regards,
Alex

>
> [1] https://lore.kernel.org/all/20240408145819.8787-1-mkoutny@suse.com/
> [2] https://lore.kernel.org/linux-api/CAHk-=wiZ40LVjnXSi9iHLE_-ZBsWFGCgdmNiYZUXn1-V5YBg2g@mail.gmail.com/