linux-kernel - Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <htzudoa4cgius7ncus67axelhv3qh6fgjgnvju27fuyw7gimla@uzrta5sfbh2w>
Date: Fri, 25 Jul 2025 19:17:43 +0200
From: Michal Koutný <mkoutny@...e.com>
To: Chen Ridong <chenridong@...weicloud.com>
Cc: tj@...nel.org, hannes@...xchg.org, lizefan@...wei.com, 
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, lujialin4@...wei.com, 
	chenridong@...wei.com, gaoyingjie@...ontech.com
Subject: Re: [PATCH v2 -next] cgroup: remove offline draining in root
 destruction to avoid hung_tasks

On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@...weicloud.com> wrote:
> >> CPU0                            CPU1
> >> mount perf_event                umount net_prio
> >> cgroup1_get_tree                cgroup_kill_sb
> >> rebind_subsystems               // root destruction enqueues
> >> 				// cgroup_destroy_wq
> >> // kill all perf_event css
> >>                                 // one perf_event css A is dying
> >>                                 // css A offline enqueues cgroup_destroy_wq
> >>                                 // root destruction will be executed first
> >>                                 css_free_rwork_fn
> >>                                 cgroup_destroy_root
> >>                                 cgroup_lock_and_drain_offline
> >>                                 // some perf descendants are dying
> >>                                 // cgroup_destroy_wq max_active = 1
> >>                                 // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> > 
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> > 
> 
> This was also first respond.
> 
> > IOW why not this:
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
> > 
> >         trace_cgroup_destroy_root(root);
> > 
> > -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> > +       cgroup_lock_and_drain_offline(cgrp);
> > 
> >         BUG_ON(atomic_read(&root->nr_cgrps));
> >         BUG_ON(!list_empty(&cgrp->self.children));
> > 
> > Does this correct the LTP scenario?
> > 
> > Thanks,
> > Michal
> 
> I've tested this approach and discovered it can lead to another issue that required significant
> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
> draining of the default hierarchy.
> 
> Consider this sequence:
> 
> mount net_prio			umount perf_event
> cgroup1_get_tree
> // &cgrp_dfl_root.cgrp
> cgroup_lock_and_drain_offline
> // wait for all perf_event csses dead
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> 				cgroup_destroy_root
> 				// &root->cgrp, not cgrp_dfl_root
> 				cgroup_lock_and_drain_offline
								perf_event's css (offline but dying)

> 				rebind_subsystems
> 				rcu_assign_pointer(dcgrp->subsys[ssid], css);
> 				dst_root->subsys_mask |= 1 << ssid;
> 				cgroup_propagate_control
> 				// enable cgrp_dfl_root perf_event css
> 				cgroup_apply_control_enable
> 				css = cgroup_css(dsct, ss);
> 				// since we drain root->cgrp not cgrp_dfl_root
> 				// css(dying) is not null on the cgrp_dfl_root
> 				// we won't create css, but the css is dying

				What would prevent seeing a dying css when
				cgrp_dfl_root is drained?
				(Or nothing drained as in the patch?)

				I assume you've seen this warning from
				cgroup_apply_control_enable
				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?


> 								
> // got the offline_waitq wake up
> goto restart;
> // some perf_event dying csses are online now
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> // never get the offline_waitq wake up
> 
> I encountered two main issues:
> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root

Is this really resolved by the patch? (The questions above.)

> 2.Potential hangs during cgrp_dfl_root draining in the mounting process

Fortunately, the typical use case (mounting at boot) wouldn't suffer
from this.

> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
> begins.

This is a valid point.

> How can we guarantee this ordering? Therefore, I propose moving the draining operation
> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
> potential race condition. This patch implements that approach.

I acknowledge the issue (although rare in real world). Some entity will
always have to wait of the offlining. It may be OK in cgroup_kill_sb
(ideally, if this was bound to process context of umount caller, not
sure if that's how kill_sb works).
I slightly dislike the form of an empty lock/unlock -- which makes me
wonder if this is the best solution.

Let me think more about this...

Thanks,
Michal

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)