[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com>
Date: Fri, 13 Sep 2024 13:17:15 -0400
From: Waiman Long <longman@...hat.com>
To: zhengzucheng <zhengzucheng@...wei.com>, peterz@...radead.org,
juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
vschneid@...hat.com, oleg@...hat.com,
Frederic Weisbecker <frederic@...nel.org>, mingo@...nel.org,
peterx@...hat.com, tj@...nel.org, tjcao980311@...il.com
Cc: linux-kernel@...r.kernel.org
Subject: Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario
On 9/13/24 00:03, zhengzucheng wrote:
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs
> are in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.
>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@...alhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
> sleep(20);
> while (1);
> return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@...alhost ~]# ./test.sh
>
> [root@...alhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73
> ./vcpu
> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71
> ./vcpu
> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72
> ./vcpu
> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72
> ./vcpu
> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72
> ./vcpu
> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72
> ./vcpu
> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>
Your script creates two cpusets with the same set of CPUs. The
scheduling aspect of the tasks, however, are not controlled by cpuset.
It is controlled by cpu cgroup. I suppose that all these tasks are in
the same cpu cgroup. It is possible that commit you mentioned might have
caused some unfairness in allocating CPU time to different processes
within the same cpu cgroup. Maybe you can try to put them into separate
cpu cgroups as well with equal weight to see if that can improve the
scheduling fairness?
BTW, you don't actually need to use 2 different cpusets if they all get
the same set of CPUs and memory nodes. Also setting
cpuset.sched_load_balance=0 may not actually get what you want unless
all the cpusets that use those CPUs have cpuset.sched_load_balance set 0
including the root cgroup. Turning off this flag may disable load
balancing, but it may not guarantee fairness depending on what CPUs are
being used by those tasks when they start unless you explicitly assign
the CPUs to them when starting these tasks.
Cheers,
Longman
Powered by blists - more mailing lists