[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bzu7va4de6ylaww2xbq67hztyokpui7qm2zcqtiwjlniyvx7dt@wf47lg6etmas>
Date: Mon, 22 Dec 2025 16:26:37 +0100
From: Michal Koutný <mkoutny@...e.com>
To: Sun Shaojie <sunshaojie@...inos.cn>
Cc: llong@...hat.com, cgroups@...r.kernel.org, chenridong@...weicloud.com,
hannes@...xchg.org, linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
shuah@...nel.org, tj@...nel.org
Subject: Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on
cpuset.cpus conflict.
Hello Shaojie.
On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@...inos.cn> wrote:
> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
> with its sibling partition, the sibling's partition state becomes invalid.
> However, this invalidation is often unnecessary.
>
> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
> under the root cgroup. Each cpuset is used by a single user(user-1 use
> A1, ... , user-m use Am), and the partition states of these cpusets are
> configured as follows:
>
> root cgroup
> / / \ \
> A1 A2 ... An Am
> (root) (root) ... (root) (root/root invalid/member)
>
> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
> states from A1 to An to change from root to root invalid, as shown
> below.
>
> root cgroup
> / / \ \
> A1 A2 ... An Am
> (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>
> This outcome is entirely undeserved for all users from A1 to An.
s/cpuset.cpus/memory.max/
When the permissions are such that the last (any) sibling can come and
claim so much to cause overcommit, then it can set up large limit and
(potentially) reclaim from others.
s/cpuset.cpus/memory.min/
Here is the overcommit approached by recalculating effective values of
memory.min, again one sibling can skew toward itself and reduce every
other's effective value.
Above are not exact analogies because first of them is Limits, the
second is Protections and cpusets are Allocations (refering to Resource
Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
But the advice to get some guarantees would be same in all cases -- if
some guarantees are expected, the permissions (of respective cgroup
attributes) should be configured so that it decouples the owner of the
cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
root or there's a middle level cgroup that'd cap each of the siblings
individually).
> After applying this patch, the first party to set "root" will maintain
> its exclusive validity. As follows:
>
> Step | A1's prstate | B1's prstate |
> #1> echo "0-1" > A1/cpuset.cpus | member | member |
> #2> echo "root" > A1/cpuset.cpus.partition | root | member |
> #3> echo "1-2" > B1/cpuset.cpus | root | member |
> #4> echo "root" > B1/cpuset.cpus.partition | root | root invalid |
>
> Step | A1's prstate | B1's prstate |
> #1> echo "0-1" > B1/cpuset.cpus | member | member |
> #2> echo "root" > B1/cpuset.cpus.partition | member | root |
> #3> echo "1-2" > A1/cpuset.cpus | member | root |
> #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root |
I'm worried that the ordering dependency would lead to situations where
users may not be immediately aware their config is overcommitting the system.
Consider that CPUs are vital for A1 but B1 can somehow survive the
degraded state, depending on the starting order the system may either
run fine (A1 valid) or fail because of A1.
I'm curious about Waiman's take.
Thanks,
Michal
Download attachment "signature.asc" of type "application/pgp-signature" (266 bytes)
Powered by blists - more mailing lists