[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3019dd82-5388-427f-8618-30b5d8b06fd8@redhat.com>
Date: Tue, 11 Nov 2025 21:21:14 -0500
From: Waiman Long <llong@...hat.com>
To: Chen Ridong <chenridong@...weicloud.com>, Waiman Long <llong@...hat.com>,
Michal Koutný <mkoutny@...e.com>
Cc: tj@...nel.org, hannes@...xchg.org, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org, lujialin4@...wei.com, chenridong@...wei.com
Subject: Re: [PATCH next] cpuset: Treat tasks in attaching process as
populated
On 11/11/25 8:58 PM, Chen Ridong wrote:
>
> On 2025/11/12 4:35, Waiman Long wrote:
>> On 11/11/25 2:25 PM, Michal Koutný wrote:
>>> On Tue, Nov 11, 2025 at 10:16:33AM -0500, Waiman Long <llong@...hat.com> wrote:
>>>> For internal helper like this one, we may not really need that as
>>>> almost all the code in cpuset.c are within either a cpuset_mutex or
>>>> callback_lock critical sections. So I am fine with or without it.
>>> OK, cpuset_mutex and callback_lock are close but cgroup_is_populated()
>>> that caught my eye would also need cgroup_mutex otherwise "the result
>>> can only be used as a hint" (quote from cgroup.h).
>>>
>>> Or is it safe to assume that cpuset_mutex inside cpuset_attach() is
>>> sufficient to always (incl. exits) ensure stability of
>>> cgroup_is_populated() result?
>>>
>>> Anyway, I'd find some clarifications in the commit message or the
>>> surrounding code about this helpful. (Judgment call, whether with a
>>> lockdep macro. My opinion is -- why not.)
>> For attach_in_progress, it is protected by the cpuset_mutex. So it may make sense to add a
>> lockdep_assert_held() for that.
>>
> Will add.
>
>> You are right that there are problems WRT the stability of cgroup_is_populated() value.
>>
>> I think "cgrp->nr_populated_csets + cs->attach_in_progress" should be almost stable for the cgroup
>> itself with cpuset_mutex, but there can be a small timing window after cpuset_attach(), but before
>> the stat is updated where the sum is 0, but there are actually tasks in the cgroup.
>>
> Do you mean there’s a small window after ss->attach (i.e., cpuset_attach) where
> cgrp->nr_populated_csets + cs->attach_in_progress could be 0?
>
> If I understand correctly:
>
> ss->can_attach: cs->attach_in_progress++, sum > 0
> css_set_move_task->css_set_update_populated: cgrp->nr_populated_csets++, sum > 0
> ss->attach->cpuset_attach: cs->attach_in_progress--, sum > 0
>
> What exactly is the small window you’re referring to where the sum equals 0?
Yes, the nr_populated_csets is incremented before calling
cpuset_attach(). I think I had mixed up the ordering. Thanks for the
clarification.
>
>> For "cgrp->nr_populated_domain_children + cgrp->nr_populated_threaded_children", it also has the
>> problem that the sum can be 0 but there are attach_in_progress set in one or more of the child
>> cgroups. So even with this patch, we can't guarantee 100% that there can be no task in the partition
>> even if it has empty effective_cpus. It is only a problem for nested local partitions though as
>> remote partitions are not allowed to exhaust all the CPUs from root cgroup.
>>
>> We should probably document that limitation to warn users if they try to create nested local
>> partitions where the parent partition root of the child partitions has empty effective_cpus.
>>
> Hmm, but it was what the commit e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty
> cpuset.cpus.effective") allowed, and it makes sense.
This commit will allow user to not waste a CPU if users want to set up a
nested local partitions where there is no task in some intermediate
levels. One way to address this gap is to iterate down the cpuset
hierarchy to check if any of its attach_in_progress count is set. So it
is a solvable problem.
There is also a flip side where cgroup_is_populated() returns a non-zero
value, but the tasks are actively being moved away. This false positive
is less a problem even though it can cause a failure in distributing out
all the CPUs to child partitions. The users can retry and it should work
the second time.
Cheers,
Longman
Powered by blists - more mailing lists