linux-kernel - Re: [PATCH v2] cpuset: relax the overlap check for cgroup-v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9332ba1-2614-44c2-b2e8-eab213f196bc@huaweicloud.com>
Date: Sat, 15 Nov 2025 15:41:03 +0800
From: Chen Ridong <chenridong@...weicloud.com>
To: Sun Shaojie <sunshaojie@...inos.cn>, mkoutny@...e.com, llong@...hat.com
Cc: cgroups@...r.kernel.org, hannes@...xchg.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
 shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v2] cpuset: relax the overlap check for cgroup-v2



On 2025/11/15 14:02, Sun Shaojie wrote:
> On 2015/11/15 08:58, Chen Ridong wrote:
>> On 2025/11/15 0:14, Michal Koutný wrote:
>>> On Fri, Nov 14, 2025 at 09:29:20AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>> After further consideration, I still suggest retaining this rule.
>>>
>>> Apologies, I'm slightly lost which rule. I hope the new iteration from
>>> Shaojie with both before/after tables will explain it.
>>>
>>
>> The rule has changed in this patch from "If either cpuset is exclusive, check if they are mutually
>> exclusive" to
>> "If both cpusets are exclusive, check if they are mutually exclusive"
>>
>>  -    /* If either cpuset is exclusive, check if they are mutually exclusive */
>>  -    if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
>>  +    /* If both cpusets are exclusive, check if they are mutually exclusive */
>>  +    if (is_cpu_exclusive(cs1) && is_cpu_exclusive(cs2))
>>  +        return !cpusets_are_exclusive(cs1, cs2);
>>
>> I suggest not modifying this rule and keeping the original logic intact:
>>
>>>> For am example:
>>>>   Step                                       | A1's prstate | B1's prstate |
>>>>   #1> mkdir -p A1                            | member       |              |
>>>>   #2> echo "0-1" > A1/cpuset.cpus.exclusive  | member       |              |
>>>>   #3> echo "root" > A1/cpuset.cpus.partition | root         |              |
>>>>   #4> mkdir -p B1                            | root         | member       |
>>>>   #5> echo "0" > B1/cpuset.cpus              | root invalid | member       |
>>>>
>>>> Currently, we mark A1 as invalid. But similar to the logic in this patch, why must A1 be
>>>> invalidated?
>>>
>>> A1 is invalidated becase it doesn't have exclusive ownership of CPU 0
>>> anymore.
>>>
>>>> B1 could also use the parent's effective CPUs, right?
>>>
>>> Here you assume some ordering between siblings treating A1 more
>>> important than B1. But it's symmetrical in principle, no?
>>>
>>
>> I’m using an example to illustrate that if Shaojie’s patch is accepted, other rules could be relaxed
>> following the same logic—but I’m not in favor of doing so.
> 
> Hi, Ridong,
> 
> Thank you for pointing out the issue with the current patch; this is indeed
> not what our product intends. I must admit that I haven't thoroughly tested
> on such recent kernel versions.
> 
> Obviously, this patch is flawed. However, patch v3 is needed. Regarding the
> "other rules" you mentioned, we do not intend to relax them. On the 
> contrary, we aim to maintain them firmly.
> 
> Our product need ensure the following behavior: in cgroup-v2, user 
> modifications to one cpuset should not affect the partition state of its 
> sibling cpusets. This is justified and meaningful, as it aligns with the 
> isolation characteristics of cgroups.
> 

This is ideal in theory, but I don’t think it’s practical in reality.

> This can be divided into two scenarios:
> Scenario 1: Only one of A1 and B1 is "root".
> Scenario 2: Both A1 and B1 are "root".
> 
> We plan to implement Scenario 1 first. This is the goal of patch v2.
> However, patch v2 is flawed because it does not strictly adhere to the 
> following existing rule.
> 
> However, it is worth noting that the current cgroup v2 implementation does 
> not strictly adhere to the following rule either (which is also an 
> objective for patch v3 to address).
> 
> Rule 1: "cpuset.cpus" cannot be a subset of a sibling's "cpuset.cpus.exclusive".
> 
> Using your example to illustrate.
>  Step (refer to the steps in the table below）
>  #1> mkdir -p A1                           
>  #2> echo "0-1" > A1/cpuset.cpus.exclusive 
>  #3> echo "root" > A1/cpuset.cpus.partition
>  #4> mkdir -p B1               
>  #5> echo "0" > B1/cpuset.cpus 
> 
> Table 1: Current result
>  Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
>  #1   | 0      |                |           | member       |              |
>  #2   | 0      | 0-1            |           | member       |              |
>  #3   | 0      | 0-1            |           | root         |              |
>  #4   | 0      | 0-1            |           | root         | member       |
>  #5   | 0      | 0-1            | 0         | root invalid | member       |
> 

I think this what we expect.

> Table 2: Expected result
>  Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
>  #1   | 0      |                |           | member       |              |
>  #2   | 0      | 0-1            |           | member       |              |
>  #3   | 0      | 0-1            |           | root         |              |
>  #4   | 0      | 0-1            |           | root         | member       |
>  #5   | error  | 0-1            |           | root         | member       |
> 

Step 5 should not return an error. As Longman pointed out, in cgroup-v2, setting cpuset.cpus should
never fail.

> Currently, after step #5, the operation returns success, which clearly 
> violates Rule 1, as B1's "cpuset.cpus" is a subset of A1's 
> "cpuset.cpus.exclusive".
> 
> Therefore, after step #5, the operation should return error, with A1 
> remaining as "root". This better complies with the Rule 1.
> 

This is an exclusivity rule. Since it violates the exclusivity rules, A1 should be invalidated.

> ------
> The following content is provided for reference, and we hope it may be 
> adopted in the future.
> !!These are not part of what patch v3 will implement.
> 
> As for Scenario 2 (Both A1 and B1 are "root")， we will retain the current 
> cgroup v2 behavior. This patch series does not modify it, but we hope to 
> draw the maintainers' attention, as we indeed have plans for future 
> modifications. Our intent can be seen from the following examples.
> 
> For example:
>  Step (refer to the steps in the table below）
>  #1> mkdir -p A1                           
>  #2> echo "0-1"  > A1/cpuset.cpus 
>  #3> echo "root" > A1/cpuset.cpus.partition
>  #4> mkdir -p B1               
>  #5> echo "2-3"  > B1/cpuset.cpus 
>  #6> echo "root" > B1/cpuset.cpus.partition
>  #7> echo "1-2"  > B1/cpuset.cpus
> 
> Table 1: Current result
>  Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
>  #1   | from parent   |               | member       |              |
>  #2   | 0-1           |               | member       |              |
>  #3   | 0-1           |               | root         |              |
>  #4   | 0-1           | from parent   | root         | member       |
>  #5   | 0-1           | 2-3           | root         | member       |
>  #6   | 0-1           | 2-3           | root         | root         |
>  #7   | 0-1           | 1-2           | root invalid | root invalid |
> 
> Table 2: Expected result
>  Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
>  #1   | from parent   |               | member       |              |
>  #2   | 0-1           |               | member       |              |
>  #3   | 0-1           |               | root         |              |
>  #4   | 0-1           | from parent   | root         | member       |
>  #5   | 0-1           | 2-3           | root         | member       |
>  #6   | 0-1           | 2-3           | root         | root         |
>  #7   | 0-1           | 2             | root         | root invalid |
> 
> After step #7, we expect A1 to remain "root" (unaffected), while only B1 
> becomes "root invalid".
> 

With the result you expect, would we observe the following behaviors:

#1> mkdir -p A1
#2> mkdir -p B1
#3> echo "0-1"  > A1/cpuset.cpus
#4> echo "1-2"  > B1/cpuset.cpus
#5> echo "root" > A1/cpuset.cpus.partition
#6> echo "root" > B1/cpuset.cpus.partition # A1:root;B1:root invalid

#1> mkdir -p A1
#2> mkdir -p B1
#3> echo "0-1"  > A1/cpuset.cpus
#4> echo "1-2"  > B1/cpuset.cpus
#5> echo "root" > B1/cpuset.cpus.partition
#6> echo "root" > A1/cpuset.cpus.partition # A1:root invalid;B1:root

Do different operation orders yield different results? If so, this is not what we expect.

-- 
Best regards,
Ridong