lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJ9yBuDnUu2jIgYT@slm.duckdns.org>
Date: Fri, 15 Aug 2025 07:44:38 -1000
From: Tejun Heo <tj@...nel.org>
To: Chen Ridong <chenridong@...weicloud.com>
Cc: hannes@...xchg.org, mkoutny@...e.com, lizefan@...wei.com,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
	lujialin4@...wei.com, chenridong@...wei.com, hdanton@...a.com
Subject: Re: [PATCH v3] cgroup: cgroup: drain specific subsystems when
 mounting/destroying a root

Hello, Chen.

On Fri, Aug 15, 2025 at 07:05:18AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@...wei.com>
> 
> A hung task can occur during [1] LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
> 
> Related case:
> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
> 
> Call Trace:
> 	cgroup_lock_and_drain_offline+0x14c/0x1e8
> 	cgroup_destroy_root+0x3c/0x2c0
> 	css_free_rwork_fn+0x248/0x338
> 	process_one_work+0x16c/0x3b8
> 	worker_thread+0x22c/0x3b0
> 	kthread+0xec/0x100
> 	ret_from_fork+0x10/0x20
> 
> Root Cause:
> 
> CPU0                            CPU1
> mount perf_event                umount net_prio
> cgroup1_get_tree                cgroup_kill_sb
> rebind_subsystems               // root destruction enqueues
> 				// cgroup_destroy_wq
> // kill all perf_event css
>                                 // one perf_event css A is dying
>                                 // css A offline enqueues cgroup_destroy_wq
>                                 // root destruction will be executed first
>                                 css_free_rwork_fn
>                                 cgroup_destroy_root
>                                 cgroup_lock_and_drain_offline
>                                 // some perf descendants are dying
>                                 // cgroup_destroy_wq max_active = 1
>                                 // waiting for css A to die
> 
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Thanks for the analysis, so this is caused by css free path waiting for css
offline.

> Solution:
> Introduce ss_mask for cgroup_lock_and_drain_offline() to selectively drain
> specific subsystems rather than all subsystems.
> 
> There are two primary scenarios requiring offline draining:
> 1. Root Operations - Draining all subsystems in cgrp_dfl_root when mounting
>    or destroying a cgroup root
> 2. Draining specific cgroup when modifying cgroup.subtree_control or
>    cgroup.threads
> 
> For case 1 (Root Operations), it only need to drain the specific subsystem
> being mounted/destroyed, not all subsystems. The rationale for draining
> cgrp_dfl_root is explained in [2].
> 
> For case 2, it's enough to drain subsystems enabled in the cgroup. Since
> other subsystems cannot have descendants in this cgroup, adding ss_mask
> should not have a hurt.

Hmm... this seems a bit fragile. Would splitting cgroup_destroy_wq into two
separate workqueues - e.g. cgroup_offline_wq and cgroup_free_wq - work?

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ