linux-kernel - Re: [PATCH] cgroup: fix race between fork and cgroup.kill

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20250204-willen-aufmachen-69e8a849a5a7@brauner>
Date: Tue, 4 Feb 2025 13:44:38 +0100
From: Christian Brauner <brauner@...nel.org>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Koutný <mkoutny@...e.com>, cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH] cgroup: fix race between fork and cgroup.kill

On Thu, Jan 30, 2025 at 04:05:42PM -0800, Shakeel Butt wrote:
> Tejun reported the following race between fork() and cgroup.kill at [1].
> 
> Tejun:
>   I was looking at cgroup.kill implementation and wondering whether there
>   could be a race window. So, __cgroup_kill() does the following:
> 
>    k1. Set CGRP_KILL.
>    k2. Iterate tasks and deliver SIGKILL.
>    k3. Clear CGRP_KILL.
> 
>   The copy_process() does the following:
> 
>    c1. Copy a bunch of stuff.
>    c2. Grab siglock.
>    c3. Check fatal_signal_pending().
>    c4. Commit to forking.
>    c5. Release siglock.
>    c6. Call cgroup_post_fork() which puts the task on the css_set and tests
>        CGRP_KILL.
> 
>   The intention seems to be that either a forking task gets SIGKILL and
>   terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
>   don't see what guarantees that k3 can't happen before c6. ie. After a
>   forking task passes c5, k2 can take place and then before the forking task
>   reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
>   What am I missing?
> 
> This is indeed a race. One way to fix this race is by taking
> cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
> side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
> to cgroup_post_fork(). However that would be heavy handed as this adds
> one more potential stall scenario for cgroup.kill which is usually
> called under extreme situation like memory pressure.
> 
> To fix this race, let's maintain a sequence number per cgroup which gets
> incremented on __cgroup_kill() call. On the fork() side, the
> cgroup_can_fork() will cache the sequence number locally and recheck it
> against the cgroup's sequence number at cgroup_post_fork() site. If the
> sequence numbers mismatch, it means __cgroup_kill() can been called and
> we should send SIGKILL to the newly created task.
> 
> Reported-by: Tejun Heo <tj@...nel.org>
> Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
> Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill")
> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> ---

Acked-by: Christian Brauner <brauner@...nel.org>