[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afc95938-0eb5-427b-a2dd-a7eccf54d891@huaweicloud.com>
Date: Fri, 15 Aug 2025 18:28:53 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Hillf Danton <hdanton@...a.com>
Cc: Michal Koutny <mkoutny@...e.com>, tj@...nel.org, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, lujialin4@...wei.com, chenridong@...wei.com,
gaoyingjie@...ontech.com
Subject: Re: [PATCH v2 -next] cgroup: remove offline draining in root
destruction to avoid hung_tasks
On 2025/8/15 18:02, Hillf Danton wrote:
> On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>> On 2025/8/15 10:40, Hillf Danton wrote:
>>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>>>> CPU0 CPU1
>>>>>> mount perf_event umount net_prio
>>>>>> cgroup1_get_tree cgroup_kill_sb
>>>>>> rebind_subsystems // root destruction enqueues
>>>>>> // cgroup_destroy_wq
>>>>>> // kill all perf_event css
>>>>>> // one perf_event css A is dying
>>>>>> // css A offline enqueues cgroup_destroy_wq
>>>>>> // root destruction will be executed first
>>>>>> css_free_rwork_fn
>>>>>> cgroup_destroy_root
>>>>>> cgroup_lock_and_drain_offline
>>>>>> // some perf descendants are dying
>>>>>> // cgroup_destroy_wq max_active = 1
>>>>>> // waiting for css A to die
>>>>>>
>>>>>> Problem scenario:
>>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>>
>>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>>> draining of the default hierachy? (Where you then run into conflict with
>>>>> perf_event that's implicit_on_dfl.)
>>>>>
>>> /*
>>> * cgroup destruction makes heavy use of work items and there can be a lot
>>> * of concurrent destructions. Use a separate workqueue so that cgroup
>>> * destruction work items don't end up filling up max_active of system_wq
>>> * which may lead to deadlock.
>>> */
>>>
>>> If task hung could be reliably reproduced, it is right time to cut
>>> max_active off for cgroup_destroy_wq according to its comment.
>>
>> Hi Danton,
>>
>> Thank you for your feedback.
>>
>> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>> addresses the issue more comprehensively.
>>
> Given your reproducer [1], it is simple to test with max_active cut.
>
> I do not think v3 is a correct fix frankly because it leaves the root cause
> intact. Nor is it cgroup specific even given high concurrency in destruction.
>
> [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
Hi Danton,
Thank you for your reply.
To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
cgroup_destroy_wq to 1?
Note that cgroup_destroy_wq already has max_active=1 as inited:
```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);```
The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline
events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't
be blocked by children offline events. My testing with the reproducer confirms this fixes the issue
I encountered.
--
Best regards,
Ridong
Powered by blists - more mailing lists