linux-kernel - Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtDMuqJUKfKSJNXMCPP13SfhG_sXMF2VUMw=6DD1XmxhWg@mail.gmail.com>
Date: Tue, 17 Sep 2024 08:19:07 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: zhengzucheng <zhengzucheng@...wei.com>
Cc: Waiman Long <longman@...hat.com>, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, oleg@...hat.com, 
	Frederic Weisbecker <frederic@...nel.org>, mingo@...nel.org, peterx@...hat.com, tj@...nel.org, 
	tjcao980311@...il.com, linux-kernel@...r.kernel.org
Subject: Re: [Question] sched：the load is unbalanced in the VM overcommitment scenario

On Sat, 14 Sept 2024 at 09:04, zhengzucheng <zhengzucheng@...wei.com> wrote:
>
>
> 在 2024/9/13 23:55, Vincent Guittot 写道:
> > On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@...wei.com> wrote:
> >> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> >> CPUs are overcommitted to 2 x 8u VMs,
> >> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> >> resources, the other VM has 6 CPUs.
> >> The host is configured with 80 CPUs in a sched domain and other CPUs are
> >> in the idle state.
> >> The root cause is that the load of the host is unbalanced, some vCPUs
> >> exclusively occupy CPU resources.
> >> when the CPU that triggers load balance calculates imbalance value,
> >> env->imbalance = 0 is calculated because of
> >> local->avg_load > sds->avg_load. As a result, the load balance fails.
> >> The processing logic:
> >> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
> >>
> >>
> >> It's normal from kernel load balance, but it's not reasonable from the
> >> perspective of VM users.
> >> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> >> domain to fix it.
> >> Is there any other method to fix this problem? thanks.
> > I'm not sure how to understand your setup and why the load balance is
> > not balancing correctly 16 vCPU between the 8 CPUs.
> >
> > >From your test case description below,  you have 8 always running
> > threads in cgroup A and 8 always running threads in cgroup B and the 2
> > cgroups have only 8 CPUs among 80. This should not be a problem for
> > load balance. I tried something similar although not exactly the same
> > with cgroupv2 and rt-app and I don't have noticeable imbalance
> >
> > Do you have more details that you can share about your system ?
> >
> > Which kernel version are you using ? Which arch ?
>
> kernel version: 6.11.0-rc7
> arch: X86_64 and cgroup v1

okay

>
> >> Abstracted reproduction case：
> >> 1.environment information：
> >>
> >> [root@...alhost ~]# cat /proc/schedstat
> >>
> >> cpu0
> >> domain0 00000000,00000000,00010000,00000000,00000001
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu1
> >> domain0 00000000,00000000,00020000,00000000,00000002
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu2
> >> domain0 00000000,00000000,00040000,00000000,00000004
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu3
> >> domain0 00000000,00000000,00080000,00000000,00000008
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> > Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG  ?
> >   and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?
>
> domain0 is SMT and domain1 is MC
> thread_siblings_list:0,80. 1,81. 2,82. 3,83

Yeah, I should have read more carefully the domain0 cpumask

> LLC is at domain1 level
>
> >> 2.test case:
> >>
> >> vcpu.c
> >> #include <stdio.h>
> >> #include <unistd.h>
> >>
> >> int main()
> >> {
> >>           sleep(20);
> >>           while (1);
> >>           return 0;
> >> }
> >>
> >> gcc vcpu.c -o vcpu
> >> -----------------------------------------------------------------
> >> test.sh
> >>
> >> #!/bin/bash
> >>
> >> #vcpu1
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> >> for i in {1..8}
> >> do
> >>           ./vcpu &
> >>           pid=$!
> >>           sleep 1
> >>           echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> >> done
> >>
> >> #vcpu2
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> >> for i in {1..8}
> >> do
> >>           ./vcpu &
> >>           pid=$!
> >>           sleep 1
> >>           echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> >> done
> >> ------------------------------------------------------------------
> >> [root@...alhost ~]# ./test.sh
> >>
> >> [root@...alhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
> >>
> >> 14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 ./vcpu
> >> 14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 ./vcpu
> >> 14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 ./vcpu
> >> 14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 ./vcpu
> >> 14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 ./vcpu
> >> 14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 ./vcpu
> >> 14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
> >> 14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
> >> 14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
> >> 14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
> >> 14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
> >> 14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
> >> 14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
> >> 14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
> >> 14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
> >> 14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu
> >>

So I finally understood your situation. The limited cpuset screws up
the avg load of system for domain1. The group_imbalanced state is
there to try to fix an imbalanced situation related to tasks that are
pinned to a subset of CPUs of the sched domain. But this can't cover
all cases.

> > .