[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <52026e81-9cf6-4711-9bd0-4b57b31021a4@redhat.com>
Date: Mon, 17 Nov 2025 13:43:02 -0500
From: Waiman Long <llong@...hat.com>
To: Sun Shaojie <sunshaojie@...inos.cn>, chenridong@...weicloud.com,
mkoutny@...e.com, llong@...hat.com
Cc: cgroups@...r.kernel.org, hannes@...xchg.org,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v2] cpuset: relax the overlap check for cgroup-v2
On 11/15/25 1:02 AM, Sun Shaojie wrote:
> On 2015/11/15 08:58, Chen Ridong wrote:
>> On 2025/11/15 0:14, Michal Koutný wrote:
>>> On Fri, Nov 14, 2025 at 09:29:20AM +0800, Chen Ridong <chenridong@...weicloud.com> wrote:
>>>> After further consideration, I still suggest retaining this rule.
>>> Apologies, I'm slightly lost which rule. I hope the new iteration from
>>> Shaojie with both before/after tables will explain it.
>>>
>> The rule has changed in this patch from "If either cpuset is exclusive, check if they are mutually
>> exclusive" to
>> "If both cpusets are exclusive, check if they are mutually exclusive"
>>
>> - /* If either cpuset is exclusive, check if they are mutually exclusive */
>> - if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2))
>> + /* If both cpusets are exclusive, check if they are mutually exclusive */
>> + if (is_cpu_exclusive(cs1) && is_cpu_exclusive(cs2))
>> + return !cpusets_are_exclusive(cs1, cs2);
>>
>> I suggest not modifying this rule and keeping the original logic intact:
>>
>>>> For am example:
>>>> Step | A1's prstate | B1's prstate |
>>>> #1> mkdir -p A1 | member | |
>>>> #2> echo "0-1" > A1/cpuset.cpus.exclusive | member | |
>>>> #3> echo "root" > A1/cpuset.cpus.partition | root | |
>>>> #4> mkdir -p B1 | root | member |
>>>> #5> echo "0" > B1/cpuset.cpus | root invalid | member |
>>>>
>>>> Currently, we mark A1 as invalid. But similar to the logic in this patch, why must A1 be
>>>> invalidated?
>>> A1 is invalidated becase it doesn't have exclusive ownership of CPU 0
>>> anymore.
>>>
>>>> B1 could also use the parent's effective CPUs, right?
>>> Here you assume some ordering between siblings treating A1 more
>>> important than B1. But it's symmetrical in principle, no?
>>>
>> I’m using an example to illustrate that if Shaojie’s patch is accepted, other rules could be relaxed
>> following the same logic—but I’m not in favor of doing so.
> Hi, Ridong,
>
> Thank you for pointing out the issue with the current patch; this is indeed
> not what our product intends. I must admit that I haven't thoroughly tested
> on such recent kernel versions.
>
> Obviously, this patch is flawed. However, patch v3 is needed. Regarding the
> "other rules" you mentioned, we do not intend to relax them. On the
> contrary, we aim to maintain them firmly.
>
> Our product need ensure the following behavior: in cgroup-v2, user
> modifications to one cpuset should not affect the partition state of its
> sibling cpusets. This is justified and meaningful, as it aligns with the
> isolation characteristics of cgroups.
>
> This can be divided into two scenarios:
> Scenario 1: Only one of A1 and B1 is "root".
> Scenario 2: Both A1 and B1 are "root".
>
> We plan to implement Scenario 1 first. This is the goal of patch v2.
> However, patch v2 is flawed because it does not strictly adhere to the
> following existing rule.
>
> However, it is worth noting that the current cgroup v2 implementation does
> not strictly adhere to the following rule either (which is also an
> objective for patch v3 to address).
>
> Rule 1: "cpuset.cpus" cannot be a subset of a sibling's "cpuset.cpus.exclusive".
Inside the cpuset code, the rule should be "cpuset.cpus should not be a
subset of sibling's cpuset.cpus.exclusive".
Note that one rule that should always be followed in v2 is that writing
any valid cpumask into cpuset.cpus is allowed, but writing to
cpuset.cpus.exclusive may fail if some rules are violated. If the new
cpuset.cpus violate the rules for a sibling partition root, the current
code will invalidate the sibling partition. I am not against changing
the cpuset.cpus.effective to a suitable value to avoid the conflict
instead invalidating a sibling partition. We do need to make sure that
the new behavior is consistent under different circumstances.
>
> Using your example to illustrate.
> Step (refer to the steps in the table below)
> #1> mkdir -p A1
> #2> echo "0-1" > A1/cpuset.cpus.exclusive
> #3> echo "root" > A1/cpuset.cpus.partition
> #4> mkdir -p B1
> #5> echo "0" > B1/cpuset.cpus
>
> Table 1: Current result
> Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
> #1 | 0 | | | member | |
> #2 | 0 | 0-1 | | member | |
> #3 | 0 | 0-1 | | root | |
> #4 | 0 | 0-1 | | root | member |
> #5 | 0 | 0-1 | 0 | root invalid | member |
>
> Table 2: Expected result
> Step | return | A1's excl_cpus | B1's cpus | A1's prstate | B1's prstate |
> #1 | 0 | | | member | |
> #2 | 0 | 0-1 | | member | |
> #3 | 0 | 0-1 | | root | |
> #4 | 0 | 0-1 | | root | member |
> #5 | error | 0-1 | | root | member |
>
> Currently, after step #5, the operation returns success, which clearly
> violates Rule 1, as B1's "cpuset.cpus" is a subset of A1's
> "cpuset.cpus.exclusive".
>
> Therefore, after step #5, the operation should return error, with A1
> remaining as "root". This better complies with the Rule 1.
>
> ------
> The following content is provided for reference, and we hope it may be
> adopted in the future.
> !!These are not part of what patch v3 will implement.
>
> As for Scenario 2 (Both A1 and B1 are "root"), we will retain the current
> cgroup v2 behavior. This patch series does not modify it, but we hope to
> draw the maintainers' attention, as we indeed have plans for future
> modifications. Our intent can be seen from the following examples.
>
> For example:
> Step (refer to the steps in the table below)
> #1> mkdir -p A1
> #2> echo "0-1" > A1/cpuset.cpus
> #3> echo "root" > A1/cpuset.cpus.partition
> #4> mkdir -p B1
> #5> echo "2-3" > B1/cpuset.cpus
> #6> echo "root" > B1/cpuset.cpus.partition
> #7> echo "1-2" > B1/cpuset.cpus
>
> Table 1: Current result
> Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
> #1 | from parent | | member | |
> #2 | 0-1 | | member | |
> #3 | 0-1 | | root | |
> #4 | 0-1 | from parent | root | member |
> #5 | 0-1 | 2-3 | root | member |
> #6 | 0-1 | 2-3 | root | root |
> #7 | 0-1 | 1-2 | root invalid | root invalid |
>
> Table 2: Expected result
> Step | A1's eft_cpus | B1's eft_cpus | A1's prstate | B1's prstate |
> #1 | from parent | | member | |
> #2 | 0-1 | | member | |
> #3 | 0-1 | | root | |
> #4 | 0-1 | from parent | root | member |
> #5 | 0-1 | 2-3 | root | member |
> #6 | 0-1 | 2-3 | root | root |
> #7 | 0-1 | 2 | root | root invalid |
>
> After step #7, we expect A1 to remain "root" (unaffected), while only B1
> becomes "root invalid".
>
>
> The following Rule 2 and Rule 3 are alsomplemented and adhered to by our
> product. The current cgroup v2 implementation does not enforce them.
> Likewise, we hope this will draw the maintainers' attention. Maybe, they can
> be applied in the future.
>
> Rule 2: In one cpuset, when "cpuset.cpus" is not null, "cpuset.cpus.effective"
> must either be a subset of it, or "cpuset.cpus.effective" is null.
cpuset.cpus.effective will never be null in v2 with the exception of a
partition root distributing out all its CPUs to child sub-partitions.
>
> Rule 3: In one cpuset, when "cpuset.cpus" is not null, "cpuset.cpus.exclusive"
> must either be a subset of it, or "cpuset.cpus.exclusive" is null.
We currently don't have this rule in the cpuset code as
cpuset.cpus.exclusive can be independent of "cpuset.cpus". We could
implement this rule by failing the cpuset.cpus.exclusive write in this
case, but we can't fail the write to "cpuset.cpus" if the rule is violated.
Cheers,
Longman
Powered by blists - more mailing lists