linux-kernel - Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <915d1261-ee9f-4080-a338-775982e1c48d@redhat.com>
Date: Mon, 31 Mar 2025 23:12:06 -0400
From: Waiman Long <llong@...hat.com>
To: Tejun Heo <tj@...nel.org>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, Shuah Khan <shuah@...nel.org>, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created
 partition and dying one

On 3/31/25 7:13 PM, Tejun Heo wrote:
> Hello,
>
> On Sun, Mar 30, 2025 at 05:52:39PM -0400, Waiman Long wrote:
> ...
>> One possible way to fix this is to iterate the dying cpusets as well and
>> avoid using the exclusive CPUs in those dying cpusets. However, this
>> can still cause random partition creation failures or other anomalies
>> due to racing. A better way to fix this race is to reset the partition
>> state at the moment when a cpuset is being killed.
> I'm not a big fan of adding another method call in the destruction path.
> css_offline() is where the kill can be seen from all CPUs and notified to
> the controller and I'm not sure why bringing it sooner would be necessary to
> close the race window. Can't the creation side drain the cgroups that are
> going down if the asynchronous part is a problem? e.g. We already have
> cgroup_lock_and_drain_offline() which isn't the most scalable thing but
> partition operations aren't very frequent, right? And if that's a problem,
> there should be a way to make it reasonably quicker.

The problem is the RCU delay between the time a cgroup is killed and is 
in a dying state and when the partition is deactivated when 
cpuset_css_offline() is called. That delay can be rather lengthy 
depending on the current workload.

Another alternative that I can think of is to scan the remote partition 
list for remote partition and sibling cpusets for local partition 
whenever some kind of conflicts are detected when enabling a partition. 
When a dying cpuset partition is detected, deactivate it immediately to 
resolve the conflict. Otherwise, the dying partition will still be 
deactivated at cpuset_css_offline() time.

That will be a bit more complex and I think can still get the problem 
solved without adding a new method. What do you think? If you are OK 
with that, I will send out a new patch later this week.

Thanks,
Longman