linux-kernel - Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <afc95938-0eb5-427b-a2dd-a7eccf54d891@huaweicloud.com>
Date: Fri, 15 Aug 2025 18:28:53 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Hillf Danton <hdanton@...a.com>
Cc: Michal Koutny <mkoutny@...e.com>, tj@...nel.org, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org, lujialin4@...wei.com, chenridong@...wei.com,
 gaoyingjie@...ontech.com
Subject: Re: [PATCH v2 -next] cgroup: remove offline draining in root
 destruction to avoid hung_tasks



On 2025/8/15 18:02, Hillf Danton wrote:
> On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>> On 2025/8/15 10:40, Hillf Danton wrote:
>>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>>>> CPU0                            CPU1
>>>>>> mount perf_event                umount net_prio
>>>>>> cgroup1_get_tree                cgroup_kill_sb
>>>>>> rebind_subsystems               // root destruction enqueues
>>>>>> 				// cgroup_destroy_wq
>>>>>> // kill all perf_event css
>>>>>>                                 // one perf_event css A is dying
>>>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>>>                                 // root destruction will be executed first
>>>>>>                                 css_free_rwork_fn
>>>>>>                                 cgroup_destroy_root
>>>>>>                                 cgroup_lock_and_drain_offline
>>>>>>                                 // some perf descendants are dying
>>>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>>>                                 // waiting for css A to die
>>>>>>
>>>>>> Problem scenario:
>>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>>
>>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>>> draining of the default hierachy? (Where you then run into conflict with
>>>>> perf_event that's implicit_on_dfl.)
>>>>>
>>> /*
>>>  * cgroup destruction makes heavy use of work items and there can be a lot
>>>  * of concurrent destructions.  Use a separate workqueue so that cgroup
>>>  * destruction work items don't end up filling up max_active of system_wq
>>>  * which may lead to deadlock.
>>>  */
>>>
>>> If task hung could be reliably reproduced, it is right time to cut
>>> max_active off for cgroup_destroy_wq according to its comment.
>>
>> Hi Danton,
>>
>> Thank you for your feedback.
>>
>> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>> addresses the issue more comprehensively.
>>
> Given your reproducer [1], it is simple to test with max_active cut.
> 
> I do not think v3 is a correct fix frankly because it leaves the root cause
> intact. Nor is it cgroup specific even given high concurrency in destruction.
> 
> [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/

Hi Danton,

Thank you for your reply.
To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
cgroup_destroy_wq to 1?

Note that cgroup_destroy_wq already has max_active=1 as inited:

```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);```

The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline
events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't
be blocked by children offline events. My testing with the reproducer confirms this fixes the issue
I encountered.

-- 
Best regards,
Ridong