[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtC9050oY2EikUTAXTL8pAui3L+Sr4DBS0T-TccGNaA2hw@mail.gmail.com>
Date: Tue, 16 May 2023 15:36:19 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Tobias Huschle <huschle@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
sshegde@...ux.vnet.ibm.com, srikar@...ux.vnet.ibm.com,
linuxppc-dev@...ts.ozlabs.org
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in
load balancer
On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@...ux.ibm.com> wrote:
>
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures. Nevertheless, s390 can have scheduler groups of unequal
> size.
>
> This introduces a performance degredation in the following scenario:
>
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
>
> Placing some workload ( x = one task ) yields the following
> scenarios:
>
> The first 5 tasks are distributed evenly across the two groups.
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> x x x x x
>
> Adding a 6th task yields the following distribution:
>
> Socket -----1----- -2-
> CPU 1 2 3 4 5 6 7 8
> SMT1 x x x x x
> SMT2 x
Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?
Could you share with us your scheduler topology ?
>
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT
> thread, which comes with a performance penalty. This means, that in
> the window of 6-8 tasks, load balancing is done suboptimally, because
> SMT is used although there is no reason to do so as fully idle CPUs
> are still available.
>
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
>
> Signed-off-by: Tobias Huschle <huschle@...ux.ibm.com>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> * group's child domain.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)
This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for such topology ? the default
path compares the number of idle CPUs when groups has spare capacity
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> --
> 2.34.1
>
Powered by blists - more mailing lists