[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f36d0c7-3476-4bc6-b66e-48496a8be514@huaweicloud.com>
Date: Fri, 25 Jul 2025 09:42:05 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: tj@...nel.org, hannes@...xchg.org, lizefan@...wei.com,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com,
chenridong@...wei.com, gaoyingjie@...ontech.com
Subject: Re: [PATCH v2 -next] cgroup: remove offline draining in root
destruction to avoid hung_tasks
On 2025/7/24 21:35, Michal Koutný wrote:
> Hi Ridong.
>
> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@...weicloud.com> wrote:
>> CPU0 CPU1
>> mount perf_event umount net_prio
>> cgroup1_get_tree cgroup_kill_sb
>> rebind_subsystems // root destruction enqueues
>> // cgroup_destroy_wq
>> // kill all perf_event css
>> // one perf_event css A is dying
>> // css A offline enqueues cgroup_destroy_wq
>> // root destruction will be executed first
>> css_free_rwork_fn
>> cgroup_destroy_root
>> cgroup_lock_and_drain_offline
>> // some perf descendants are dying
>> // cgroup_destroy_wq max_active = 1
>> // waiting for css A to die
>>
>> Problem scenario:
>> 1. CPU0 mounts perf_event (rebind_subsystems)
>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>> 3. A dying perf_event CSS gets queued for offline after root destruction
>> 4. Root destruction waits for offline completion, but offline work is
>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>
> What's concerning me is why umount of net_prio hierarhy waits for
> draining of the default hierachy? (Where you then run into conflict with
> perf_event that's implicit_on_dfl.)
>
This was also first respond.
> IOW why not this:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>
> trace_cgroup_destroy_root(root);
>
> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> + cgroup_lock_and_drain_offline(cgrp);
>
> BUG_ON(atomic_read(&root->nr_cgrps));
> BUG_ON(!list_empty(&cgrp->self.children));
>
> Does this correct the LTP scenario?
>
> Thanks,
> Michal
I've tested this approach and discovered it can lead to another issue that required significant
investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
draining of the default hierarchy.
Consider this sequence:
mount net_prio umount perf_event
cgroup1_get_tree
// &cgrp_dfl_root.cgrp
cgroup_lock_and_drain_offline
// wait for all perf_event csses dead
prepare_to_wait(&dsct->offline_waitq)
schedule();
cgroup_destroy_root
// &root->cgrp, not cgrp_dfl_root
cgroup_lock_and_drain_offline
rebind_subsystems
rcu_assign_pointer(dcgrp->subsys[ssid], css);
dst_root->subsys_mask |= 1 << ssid;
cgroup_propagate_control
// enable cgrp_dfl_root perf_event css
cgroup_apply_control_enable
css = cgroup_css(dsct, ss);
// since we drain root->cgrp not cgrp_dfl_root
// css(dying) is not null on the cgrp_dfl_root
// we won't create css, but the css is dying
// got the offline_waitq wake up
goto restart;
// some perf_event dying csses are online now
prepare_to_wait(&dsct->offline_waitq)
schedule();
// never get the offline_waitq wake up
I encountered two main issues:
1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
2.Potential hangs during cgrp_dfl_root draining in the mounting process
I have tried calling cgroup_lock_and_drain_offline with the subtree_ss_mask, It seems that can fix
this issue I encountered. But I am not sure there are scenarios [u]mounting mutil legacy subsystem
in the same time.
I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
begins. How can we guarantee this ordering? Therefore, I propose moving the draining operation
outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
potential race condition. This patch implements that approach.
Best regards,
Ridong
Powered by blists - more mailing lists