[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com>
Date: Tue, 23 Dec 2025 01:03:42 -0500
From: Waiman Long <llong@...hat.com>
To: Michal Koutný <mkoutny@...e.com>,
Sun Shaojie <sunshaojie@...inos.cn>
Cc: llong@...hat.com, cgroups@...r.kernel.org, chenridong@...weicloud.com,
hannes@...xchg.org, linux-kernel@...r.kernel.org,
linux-kselftest@...r.kernel.org, shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on
cpuset.cpus conflict.
On 12/22/25 10:26 AM, Michal Koutný wrote:
> Hello Shaojie.
>
> On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@...inos.cn> wrote:
>> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
>> with its sibling partition, the sibling's partition state becomes invalid.
>> However, this invalidation is often unnecessary.
>>
>> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
>> under the root cgroup. Each cpuset is used by a single user(user-1 use
>> A1, ... , user-m use Am), and the partition states of these cpusets are
>> configured as follows:
>>
>> root cgroup
>> / / \ \
>> A1 A2 ... An Am
>> (root) (root) ... (root) (root/root invalid/member)
>>
>> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
>> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
>> states from A1 to An to change from root to root invalid, as shown
>> below.
>>
>> root cgroup
>> / / \ \
>> A1 A2 ... An Am
>> (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>>
>> This outcome is entirely undeserved for all users from A1 to An.
> s/cpuset.cpus/memory.max/
>
> When the permissions are such that the last (any) sibling can come and
> claim so much to cause overcommit, then it can set up large limit and
> (potentially) reclaim from others.
>
> s/cpuset.cpus/memory.min/
>
> Here is the overcommit approached by recalculating effective values of
> memory.min, again one sibling can skew toward itself and reduce every
> other's effective value.
>
> Above are not exact analogies because first of them is Limits, the
> second is Protections and cpusets are Allocations (refering to Resource
> Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
>
> But the advice to get some guarantees would be same in all cases -- if
> some guarantees are expected, the permissions (of respective cgroup
> attributes) should be configured so that it decouples the owner of the
> cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
> root or there's a middle level cgroup that'd cap each of the siblings
> individually).
>
From sibling point of view, CPUs in partitions are exclusive. A cpuset
either have all the requested CPUs to form a partition (assuming that at
least one can be granted from the parent cpuset) or it doesn't have all
of them and fails to form a valid partition. It is different from memory
that a cgroup can have a reduced amount of memory than requested and can
still work fine.
Anyway, I consider using cpuset.cpus to form a partition is legacy and
is supported for backward compatibility reason. Now the proper way to
form a partition is to use cpuset.cpus.exclusive, the setting of it can
fail if it conflicts with siblings.
By using cpuset.cpus only to form partitions, the cpuset.cpus value will
be treated the same as cpuset.cpus.exclusive if a valid partition is
formed. In that sense, the examples listed in the patch will have the
same result if cpuset.cpu.exclusive is used instead of cpuset.cpus. The
difference is that writing to the cpuset.cpus.exclusive will fail
instead of forming an invalid partition in the case of cpust.cpus.
>> After applying this patch, the first party to set "root" will maintain
>> its exclusive validity. As follows:
>>
>> Step | A1's prstate | B1's prstate |
>> #1> echo "0-1" > A1/cpuset.cpus | member | member |
>> #2> echo "root" > A1/cpuset.cpus.partition | root | member |
>> #3> echo "1-2" > B1/cpuset.cpus | root | member |
>> #4> echo "root" > B1/cpuset.cpus.partition | root | root invalid |
>>
>> Step | A1's prstate | B1's prstate |
>> #1> echo "0-1" > B1/cpuset.cpus | member | member |
>> #2> echo "root" > B1/cpuset.cpus.partition | member | root |
>> #3> echo "1-2" > A1/cpuset.cpus | member | root |
>> #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root |
> I'm worried that the ordering dependency would lead to situations where
> users may not be immediately aware their config is overcommitting the system.
> Consider that CPUs are vital for A1 but B1 can somehow survive the
> degraded state, depending on the starting order the system may either
> run fine (A1 valid) or fail because of A1.
>
> I'm curious about Waiman's take.
That is why I will recommend users to use cpuset.cpus.exclusive to form
partition as they can get early feedback if they are overcommitting. Of
course, setting cpuset.cpus.exclusive without failure still doesn't
guarantee the formation of a valid partition if none of the exclusive
CPUs can be granted from the parent.
Cheers,
Longman
Powered by blists - more mailing lists