linux-kernel - Re: [PATCH v3] cgroup: cgroup: drain specific subsystems when mounting/destroying a root

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <33990c53-5d7a-4f15-81b4-661c7bb96937@huaweicloud.com>
Date: Sat, 16 Aug 2025 09:26:59 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Tejun Heo <tj@...nel.org>
Cc: hannes@...xchg.org, mkoutny@...e.com, lizefan@...wei.com,
 cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com,
 chenridong@...wei.com, hdanton@...a.com
Subject: Re: [PATCH v3] cgroup: cgroup: drain specific subsystems when
 mounting/destroying a root



On 2025/8/16 1:44, Tejun Heo wrote:
> Hello, Chen.
> 
> On Fri, Aug 15, 2025 at 07:05:18AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@...wei.com>
>>
>> A hung task can occur during [1] LTP cgroup testing when repeatedly
>> mounting/unmounting perf_event and net_prio controllers with
>> systemd.unified_cgroup_hierarchy=1. The hang manifests in
>> cgroup_lock_and_drain_offline() during root destruction.
>>
>> Related case:
>> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
>> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
>>
>> Call Trace:
>> 	cgroup_lock_and_drain_offline+0x14c/0x1e8
>> 	cgroup_destroy_root+0x3c/0x2c0
>> 	css_free_rwork_fn+0x248/0x338
>> 	process_one_work+0x16c/0x3b8
>> 	worker_thread+0x22c/0x3b0
>> 	kthread+0xec/0x100
>> 	ret_from_fork+0x10/0x20
>>
>> Root Cause:
>>
>> CPU0                            CPU1
>> mount perf_event                umount net_prio
>> cgroup1_get_tree                cgroup_kill_sb
>> rebind_subsystems               // root destruction enqueues
>> 				// cgroup_destroy_wq
>> // kill all perf_event css
>>                                 // one perf_event css A is dying
>>                                 // css A offline enqueues cgroup_destroy_wq
>>                                 // root destruction will be executed first
>>                                 css_free_rwork_fn
>>                                 cgroup_destroy_root
>>                                 cgroup_lock_and_drain_offline
>>                                 // some perf descendants are dying
>>                                 // cgroup_destroy_wq max_active = 1
>>                                 // waiting for css A to die
>>
>> Problem scenario:
>> 1. CPU0 mounts perf_event (rebind_subsystems)
>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>> 3. A dying perf_event CSS gets queued for offline after root destruction
>> 4. Root destruction waits for offline completion, but offline work is
>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> 
> Thanks for the analysis, so this is caused by css free path waiting for css
> offline.
> 
>> Solution:
>> Introduce ss_mask for cgroup_lock_and_drain_offline() to selectively drain
>> specific subsystems rather than all subsystems.
>>
>> There are two primary scenarios requiring offline draining:
>> 1. Root Operations - Draining all subsystems in cgrp_dfl_root when mounting
>>    or destroying a cgroup root
>> 2. Draining specific cgroup when modifying cgroup.subtree_control or
>>    cgroup.threads
>>
>> For case 1 (Root Operations), it only need to drain the specific subsystem
>> being mounted/destroyed, not all subsystems. The rationale for draining
>> cgrp_dfl_root is explained in [2].
>>
>> For case 2, it's enough to drain subsystems enabled in the cgroup. Since
>> other subsystems cannot have descendants in this cgroup, adding ss_mask
>> should not have a hurt.
> 
> Hmm... this seems a bit fragile. Would splitting cgroup_destroy_wq into two
> separate workqueues - e.g. cgroup_offline_wq and cgroup_free_wq - work?
> 
> Thanks.
> 

Hi Tj,

I've tested that adding a dedicated cgroup_offline_wq workqueue for CSS offline operations, which
could resolve the current issue.

Going further, I propose we split cgroup_destroy_wq into three specialized workqueues to better
match the destruction lifecycle:
cgroup_offline_wq - Handles offline operations
cgroup_release_wq - Manages resource release
cgroup_free_wq - Performs final memory freeing

This explicit separation would clearly delineate responsibilities for each workqueue.

What are your thoughts on this approach, Tejun?

-- 
Best regards,
Ridong