linux-kernel - Re: [PATCH v2] cpuset: relax the overlap check for cgroup-v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <52026e81-9cf6-4711-9bd0-4b57b31021a4@redhat.com>
Date: Mon, 17 Nov 2025 13:43:02 -0500
From: Waiman Long <llong@...hat.com>
To: Sun Shaojie <sunshaojie@...inos.cn>, chenridong@...weicloud.com,
 mkoutny@...e.com, llong@...hat.com
Cc: cgroups@...r.kernel.org, hannes@...xchg.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
 shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v2] cpuset: relax the overlap check for cgroup-v2

On 11/15/25 1:02 AM, Sun Shaojie wrote:
> On 2015/11/15 08:58, Chen Ridong wrote:
>> On 2025/11/15 0:14, Michal Koutný wrote:
>>> On Fri, Nov 14, 2025 at 09:29:20AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>> After further consideration, I still suggest retaining this rule.
>>> Apologies, I'm slightly lost which rule. I hope the new iteration from
>>> Shaojie with both before/after tables will explain it.
>>>
>> The rule has changed in this patch from "If either cpuset is exclusive, check if they are mutually
>> exclusive" to
>> "If both cpusets are exclusive, check if they are mutually exclusive"
>>
>>   -    /* If either cpuset is exclusive, check if they are mutually exclusive */
>>   -    if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
>>   +    /* If both cpusets are exclusive, check if they are mutually exclusive */
>>   +    if (is_cpu_exclusive(cs1) && is_cpu_exclusive(cs2))
>>   +        return !cpusets_are_exclusive(cs1, cs2);
>>
>> I suggest not modifying this rule and keeping the original logic intact:
>>
>>>> For am example:
>>>>    Step                                       | A1's prstate | B1's prstate |
>>>>    #1> mkdir -p A1                            | member       |              |
>>>>    #2> echo "0-1" > A1/cpuset.cpus.exclusive  | member       |              |
>>>>    #3> echo "root" > A1/cpuset.cpus.partition | root         |              |
>>>>    #4> mkdir -p B1                            | root         | member       |
>>>>    #5> echo "0" > B1/cpuset.cpus              | root invalid | member       |
>>>>
>>>> Currently, we mark A1 as invalid. But similar to the logic in this patch, why must A1 be
>>>> invalidated?
>>> A1 is invalidated becase it doesn't have exclusive ownership of CPU 0
>>> anymore.
>>>
>>>> B1 could also use the parent's effective CPUs, right?
>>> Here you assume some ordering between siblings treating A1 more
>>> important than B1. But it's symmetrical in principle, no?
>>>
>> I’m using an example to illustrate that if Shaojie’s patch is accepted, other rules could be relaxed
>> following the same logic—but I’m not in favor of doing so.
> Hi, Ridong,
>
> Thank you for pointing out the issue with the current patch; this is indeed
> not what our product intends. I must admit that I haven't thoroughly tested
> on such recent kernel versions.
>
> Obviously, this patch is flawed. However, patch v3 is needed. Regarding the
> "other rules" you mentioned, we do not intend to relax them. On the
> contrary, we aim to maintain them firmly.
>
> Our product need ensure the following behavior: in cgroup-v2, user
> modifications to one cpuset should not affect the partition state of its
> sibling cpusets. This is justified and meaningful, as it aligns with the
> isolation characteristics of cgroups.
>
> This can be divided into two scenarios:
> Scenario 1: Only one of A1 and B1 is "root".
> Scenario 2: Both A1 and B1 are "root".
>
> We plan to implement Scenario 1 first. This is the goal of patch v2.
> However, patch v2 is flawed because it does not strictly adhere to the
> following existing rule.
>
> However, it is worth noting that the current cgroup v2 implementation does
> not strictly adhere to the following rule either (which is also an
> objective for patch v3 to address).
>
> Rule 1: "cpuset.cpus" cannot be a subset of a sibling's "cpuset.cpus.exclusive".

Inside the cpuset code, the rule should be "cpuset.cpus should not be a 
subset of sibling's cpuset.cpus.exclusive".

Note that one rule that should always be followed in v2 is that writing 
any valid cpumask into cpuset.cpus is allowed, but writing to 
cpuset.cpus.exclusive may fail if some rules are violated. If the new 
cpuset.cpus violate the rules for a sibling partition root, the current 
code will invalidate the sibling partition. I am not against changing 
the cpuset.cpus.effective to a suitable value to avoid the conflict 
instead invalidating a sibling partition. We do need to make sure that 
the new behavior is consistent under different circumstances.

>
> Using your example to illustrate.
>   Step (refer to the steps in the table below）
>   #1> mkdir -p A1
>   #2> echo "0-1" > A1/cpuset.cpus.exclusive
>   #3> echo "root" > A1/cpuset.cpus.partition
>   #4> mkdir -p B1
>   #5> echo "0" > B1/cpuset.cpus
>
> Table 1: Current result
>   Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
>   #1   | 0      |                |           | member       |              |
>   #2   | 0      | 0-1            |           | member       |              |
>   #3   | 0      | 0-1            |           | root         |              |
>   #4   | 0      | 0-1            |           | root         | member       |
>   #5   | 0      | 0-1            | 0         | root invalid | member       |
>
> Table 2: Expected result
>   Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
>   #1   | 0      |                |           | member       |              |
>   #2   | 0      | 0-1            |           | member       |              |
>   #3   | 0      | 0-1            |           | root         |              |
>   #4   | 0      | 0-1            |           | root         | member       |
>   #5   | error  | 0-1            |           | root         | member       |
>
> Currently, after step #5, the operation returns success, which clearly
> violates Rule 1, as B1's "cpuset.cpus" is a subset of A1's
> "cpuset.cpus.exclusive".
>
> Therefore, after step #5, the operation should return error, with A1
> remaining as "root". This better complies with the Rule 1.
>
> ------
> The following content is provided for reference, and we hope it may be
> adopted in the future.
> !!These are not part of what patch v3 will implement.
>
> As for Scenario 2 (Both A1 and B1 are "root")， we will retain the current
> cgroup v2 behavior. This patch series does not modify it, but we hope to
> draw the maintainers' attention, as we indeed have plans for future
> modifications. Our intent can be seen from the following examples.
>
> For example:
>   Step (refer to the steps in the table below）
>   #1> mkdir -p A1
>   #2> echo "0-1"  > A1/cpuset.cpus
>   #3> echo "root" > A1/cpuset.cpus.partition
>   #4> mkdir -p B1
>   #5> echo "2-3"  > B1/cpuset.cpus
>   #6> echo "root" > B1/cpuset.cpus.partition
>   #7> echo "1-2"  > B1/cpuset.cpus
>
> Table 1: Current result
>   Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
>   #1   | from parent   |               | member       |              |
>   #2   | 0-1           |               | member       |              |
>   #3   | 0-1           |               | root         |              |
>   #4   | 0-1           | from parent   | root         | member       |
>   #5   | 0-1           | 2-3           | root         | member       |
>   #6   | 0-1           | 2-3           | root         | root         |
>   #7   | 0-1           | 1-2           | root invalid | root invalid |
>
> Table 2: Expected result
>   Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
>   #1   | from parent   |               | member       |              |
>   #2   | 0-1           |               | member       |              |
>   #3   | 0-1           |               | root         |              |
>   #4   | 0-1           | from parent   | root         | member       |
>   #5   | 0-1           | 2-3           | root         | member       |
>   #6   | 0-1           | 2-3           | root         | root         |
>   #7   | 0-1           | 2             | root         | root invalid |
>
> After step #7, we expect A1 to remain "root" (unaffected), while only B1
> becomes "root invalid".
>
>   
> The following Rule 2 and Rule 3 are alsomplemented and adhered to by our
> product. The current cgroup v2 implementation does not enforce them.
> Likewise, we hope this will draw the maintainers' attention. Maybe, they can
> be applied in the future.
>
> Rule 2: In one cpuset, when "cpuset.cpus" is not null, "cpuset.cpus.effective"
>          must either be a subset of it, or "cpuset.cpus.effective" is null.

cpuset.cpus.effective will never be null in v2 with the exception of a 
partition root distributing out all its CPUs to child sub-partitions.


>
> Rule 3: In one cpuset, when "cpuset.cpus" is not null, "cpuset.cpus.exclusive"
>          must either be a subset of it, or "cpuset.cpus.exclusive" is null.

We currently don't have this rule in the cpuset code as 
cpuset.cpus.exclusive can be independent of "cpuset.cpus".  We could 
implement this rule by failing the cpuset.cpus.exclusive write in this 
case, but we can't fail the write to "cpuset.cpus" if the rule is violated.

Cheers,
Longman