lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com>
Date: Tue, 23 Dec 2025 01:03:42 -0500
From: Waiman Long <llong@...hat.com>
To: Michal Koutný <mkoutny@...e.com>,
 Sun Shaojie <sunshaojie@...inos.cn>
Cc: llong@...hat.com, cgroups@...r.kernel.org, chenridong@...weicloud.com,
 hannes@...xchg.org, linux-kernel@...r.kernel.org,
 linux-kselftest@...r.kernel.org, shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on
 cpuset.cpus conflict.

On 12/22/25 10:26 AM, Michal Koutný wrote:
> Hello Shaojie.
>
> On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@...inos.cn> wrote:
>> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
>> with its sibling partition, the sibling's partition state becomes invalid.
>> However, this invalidation is often unnecessary.
>>
>> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
>> under the root cgroup. Each cpuset is used by a single user(user-1 use
>> A1, ... , user-m use Am), and the partition states of these cpusets are
>> configured as follows:
>>
>>                             root cgroup
>>          /             /                  \                 \
>>         A1            A2        ...       An                Am
>>       (root)        (root)      ...     (root) (root/root invalid/member)
>>
>> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
>> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
>> states from A1 to An to change from root to root invalid, as shown
>> below.
>>
>>                             root cgroup
>>          /              /                 \                 \
>>         A1             A2       ...       An                Am
>>   (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>>
>> This outcome is entirely undeserved for all users from A1 to An.
> s/cpuset.cpus/memory.max/
>
> When the permissions are such that the last (any) sibling can come and
> claim so much to cause overcommit, then it can set up large limit and
> (potentially) reclaim from others.
>
> s/cpuset.cpus/memory.min/
>
> Here is the overcommit approached by recalculating effective values of
> memory.min, again one sibling can skew toward itself and reduce every
> other's effective value.
>
> Above are not exact analogies because first of them is Limits, the
> second is Protections and cpusets are Allocations (refering to Resource
> Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
>
> But the advice to get some guarantees would be same in all cases -- if
> some guarantees are expected, the permissions (of respective cgroup
> attributes) should be configured so that it decouples the owner of the
> cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
> root or there's a middle level cgroup that'd cap each of the siblings
> individually).
>
 From sibling point of view, CPUs in partitions are exclusive. A cpuset 
either have all the requested CPUs to form a partition (assuming that at 
least one can be granted from the parent cpuset) or it doesn't have all 
of them and fails to form a valid partition. It is different from memory 
that a cgroup can have a reduced amount of memory than requested and can 
still work fine.

Anyway, I consider using cpuset.cpus to form a partition is legacy and 
is supported for backward compatibility reason. Now the proper way to 
form a partition is to use cpuset.cpus.exclusive, the setting of it can 
fail if it conflicts with siblings.

By using cpuset.cpus only to form partitions, the cpuset.cpus value will 
be treated the same as cpuset.cpus.exclusive if a valid partition is 
formed. In that sense, the examples listed in the patch will have the 
same result if cpuset.cpu.exclusive is used instead of cpuset.cpus. The 
difference is that writing to the cpuset.cpus.exclusive will fail 
instead of forming an invalid partition in the case of cpust.cpus.

>> After applying this patch, the first party to set "root" will maintain
>> its exclusive validity. As follows:
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>>   #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
>>   #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
>>   #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
>>   #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |
> I'm worried that the ordering dependency would lead to situations where
> users may not be immediately aware their config is overcommitting the system.
> Consider that CPUs are vital for A1 but B1 can somehow survive the
> degraded state, depending on the starting order the system may either
> run fine (A1 valid) or fail because of A1.
>
> I'm curious about Waiman's take.

That is why I will recommend users to use cpuset.cpus.exclusive to form 
partition as they can get early feedback if they are overcommitting. Of 
course, setting cpuset.cpus.exclusive without failure still doesn't 
guarantee the formation of a valid partition if none of the exclusive 
CPUs can be granted from the parent.

Cheers,
Longman


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ