linux-kernel - Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCdfzZ9Wxr7+zH5WW171LJGttgzto4W2wH9mm4d0jcTLg@mail.gmail.com>
Date: Fri, 13 Sep 2024 17:55:23 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: zhengzucheng <zhengzucheng@...wei.com>
Cc: Waiman Long <longman@...hat.com>, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, oleg@...hat.com, 
	Frederic Weisbecker <frederic@...nel.org>, mingo@...nel.org, peterx@...hat.com, tj@...nel.org, 
	tjcao980311@...il.com, linux-kernel@...r.kernel.org
Subject: Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@...wei.com> wrote:
>
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs are
> in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.

I'm not sure how to understand your setup and why the load balance is
not balancing correctly 16 vCPU between the 8 CPUs.

>From your test case description below,  you have 8 always running
threads in cgroup A and 8 always running threads in cgroup B and the 2
cgroups have only 8 CPUs among 80. This should not be a problem for
load balance. I tried something similar although not exactly the same
with cgroupv2 and rt-app and I don't have noticeable imbalance

Do you have more details that you can share about your system ?

Which kernel version are you using ? Which arch ?

>
> Abstracted reproduction case：
> 1.environment information：
>
> [root@...alhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG  ?
 and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?

>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
>          sleep(20);
>          while (1);
>          return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
>          ./vcpu &
>          pid=$!
>          sleep 1
>          echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
>          ./vcpu &
>          pid=$!
>          sleep 1
>          echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@...alhost ~]# ./test.sh
>
> [root@...alhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 ./vcpu
> 14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 ./vcpu
> 14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 ./vcpu
> 14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 ./vcpu
> 14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 ./vcpu
> 14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 ./vcpu
> 14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
> 14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
> 14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
> 14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
> 14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
> 14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
> 14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
> 14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
> 14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
> 14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu
>