[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c28b46b59bcc083956757074d1fe059@linux.ibm.com>
Date: Tue, 04 Jul 2023 11:11:00 +0200
From: Tobias Huschle <huschle@...ux.ibm.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com,
vschneid@...hat.com, sshegde@...ux.vnet.ibm.com,
srikar@...ux.vnet.ibm.com, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in
load balancer
On 2023-05-16 18:35, Dietmar Eggemann wrote:
> On 15/05/2023 13:46, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same scheduler domain, all host the same number of CPUs.
>>
>> This appears to be valid for non-s390 architectures. Nevertheless,
>> s390
>> can actually have scheduler groups of unequal size.
>
> Arm (classical) big.Little had this for years before we switched to
> flat
> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
> DynamIQ.
>
> Arm64 Juno platform in mainline:
>
> root@...o:~# cat
> /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
> 0,3-5
> 1-2
> 1-2
> 0,3-5
> 0,3-5
> 0,3-5
>
> root@...o:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
>
> domain0 39 <--
> domain1 3f
> domain0 06 <--
> domain1 3f
> domain0 06
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
>
> root@...o:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> DIE
>
> But we don't have SMT on the mobile processors.
>
> It looks like you are only interested to get group_weight dependency
> into this 'prefer_sibling' condition of find_busiest_group()?
>
Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.
Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.
We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:
With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x
Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11 || 12 13 14 15
x x x x x x x x
Hence the idea to add the group_weight to the prefer_sibling path.
I was wondering if this would be the right place to address this issue
or if I should go down another route.
> We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
> SD_PREFER_SIBLING from sd->child so we don't run this condition.
>
>> The current scheduler behavior causes some s390 configs to use SMT
>> while some cores are still idle, leading to a performance degredation
>> under certain levels of workload.
>>
>> Please refer to the patch's commit message for more details and an
>> example. This patch is a proposal on how to integrate the size of
>> scheduler groups into the decision process.
>>
>> This patch is the most basic approach to address this issue and does
>> not claim to be perfect as-is.
>>
>> Other ideas that also proved to address the problem but are more
>> complex but also potentially more precise:
>> 1. On scheduler group building, count the number of CPUs within each
>> group that are first in their sibling mask. This represents the
>> number of CPUs that can be used before running into SMT. This
>> should be slightly more accurate than using the full group weight
>> if the number of available SMT threads per core varies.
>> 2. Introduce a new scheduler group classification (smt_busy) in
>> between of fully_busy and has_spare. This classification would
>> indicate that a group still has spare capacity, but will run
>> into SMT when using that capacity. This would make the load
>> balancer prefer groups with fully idle CPUs over ones that are
>> about to run into SMT.
>>
>> Feedback would be greatly appreciated.
>>
>> Tobias Huschle (1):
>> sched/fair: Consider asymmetric scheduler groups in load balancer
>>
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
Powered by blists - more mailing lists