linux-kernel - Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d8d7c633-a9e3-4990-8904-4c7710894789@redhat.com>
Date: Tue, 1 Apr 2025 16:56:40 -0400
From: Waiman Long <llong@...hat.com>
To: Waiman Long <llong@...hat.com>, Tejun Heo <tj@...nel.org>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Koutný
 <mkoutny@...e.com>, Shuah Khan <shuah@...nel.org>, cgroups@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH 01/10] cgroup/cpuset: Fix race between newly created
 partition and dying one


On 4/1/25 4:41 PM, Waiman Long wrote:
>
> On 4/1/25 3:59 PM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Mon, Mar 31, 2025 at 11:12:06PM -0400, Waiman Long wrote:
>>> The problem is the RCU delay between the time a cgroup is killed and 
>>> is in a
>>> dying state and when the partition is deactivated when 
>>> cpuset_css_offline()
>>> is called. That delay can be rather lengthy depending on the current
>>> workload.
>> If we don't have to do it too often, synchronize_rcu_expedited() may be
>> workable too. What do you think?
>
> I don't think we ever call synchronize_rcu() in the cgroup code except 
> for rstat flush. In fact, we didn't use to have an easy way to know if 
> there were dying cpusets hanging around. Now we can probably use the 
> root cgroup's nr_dying_subsys[cpuset_cgrp_id] to know if we need to 
> use synchronize_rcu*() call to wait for it. However, I still need to 
> check if there is any racing window that will cause us to miss it.

Sorry, I don't think I can use synchronize_rcu_expedited() as the use 
cases that I am seeing most often is the creation of isolated partitions 
running latency sensitive applications like DPDK. Using 
synchronize_rcu_expedited() will send IPIs to all the CPUs which may 
break the required latency guarantee for those applications. Just using 
synchronize_rcu(), however, will have unpredictable latency impacting 
user experience.

>
>>
>>> Another alternative that I can think of is to scan the remote 
>>> partition list
>>> for remote partition and sibling cpusets for local partition 
>>> whenever some
>>> kind of conflicts are detected when enabling a partition. When a dying
>>> cpuset partition is detected, deactivate it immediately to resolve the
>>> conflict. Otherwise, the dying partition will still be deactivated at
>>> cpuset_css_offline() time.
>>>
>>> That will be a bit more complex and I think can still get the 
>>> problem solved
>>> without adding a new method. What do you think? If you are OK with 
>>> that, I
>>> will send out a new patch later this week.
>> If synchronize_rcu_expedited() won't do, let's go with the original 
>> patch.
>> The operation does make general sense in that it's for a distinctive 
>> step in
>> the destruction process although I'm a bit curious why it's called 
>> before
>> DYING is set.
>
Because of the above, I still prefer either using the original patch or 
scanning for dying cpuset partitions in case a conflict is detected. 
Please let me know what you think about it.

Thanks,
Longman