linux-kernel - Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com>
Date: Fri, 13 Sep 2024 13:17:15 -0400
From: Waiman Long <longman@...hat.com>
To: zhengzucheng <zhengzucheng@...wei.com>, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, oleg@...hat.com,
 Frederic Weisbecker <frederic@...nel.org>, mingo@...nel.org,
 peterx@...hat.com, tj@...nel.org, tjcao980311@...il.com
Cc: linux-kernel@...r.kernel.org
Subject: Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

On 9/13/24 00:03, zhengzucheng wrote:
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs 
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs 
> are in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs 
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value, 
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic: 
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the 
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule 
> domain to fix it.
> Is there any other method to fix this problem? thanks.
>
> Abstracted reproduction case：
> 1.environment information：
>
> [root@...alhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
>         sleep(20);
>         while (1);
>         return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
>         ./vcpu &
>         pid=$!
>         sleep 1
>         echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
>         ./vcpu &
>         pid=$!
>         sleep 1
>         echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@...alhost ~]# ./test.sh
>
> [root@...alhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 
> ./vcpu
> 14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 
> ./vcpu
> 14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 
> ./vcpu
> 14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 
> ./vcpu
> 14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 
> ./vcpu
> 14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 
> ./vcpu
> 14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
> 14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
> 14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
> 14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
> 14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
> 14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
> 14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
> 14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
> 14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
> 14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu
>
Your script creates two cpusets with the same set of CPUs. The 
scheduling aspect of the tasks, however, are not controlled by cpuset. 
It is controlled by cpu cgroup. I suppose that all these tasks are in 
the same cpu cgroup. It is possible that commit you mentioned might have 
caused some unfairness in allocating CPU time to different processes 
within the same cpu cgroup. Maybe you can try to put them into separate 
cpu cgroups as well with equal weight to see if that can improve the 
scheduling fairness?

BTW, you don't actually need to use 2 different cpusets if they all get 
the same set of CPUs and memory nodes. Also setting 
cpuset.sched_load_balance=0 may not actually get what you want unless 
all the cpusets that use those CPUs have cpuset.sched_load_balance set 0 
including the root cgroup. Turning off this flag may disable load 
balancing, but it may not guarantee fairness depending on what CPUs are 
being used by those tasks when they start unless you explicitly assign 
the CPUs to them when starting these tasks.

Cheers,
Longman