linux-kernel - Re: [RFC PATCH] sched/fair: Fix impossible migrate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230724161038.nreywdwayiq2ypty@airbuntu>
Date:   Mon, 24 Jul 2023 17:10:38 +0100
From:   Qais Yousef <qyousef@...alina.io>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in
 load balance

On 07/24/23 14:58, Dietmar Eggemann wrote:
> On 22/07/2023 00:04, Qais Yousef wrote:
> > On 07/21/23 15:52, Vincent Guittot wrote:
> >> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
> >>> On 07/20/23 14:31, Vincent Guittot wrote:
> >>>
> >>>> I was trying to reproduce the behavior but I was failing until I
> >>>> realized that this code path is used when the 2 groups are not sharing
> >>>> their cache. Which topology do you use ? I thought that dynamiQ and
> >>>> shares cache between all 8 cpus was the norm for arm64 embedded device
> >>>> now
> >>>
> >>> Hmm good question. phantom domains didn't die which I think is what causing
> >>> this. I can look if this is for a good reason or just historical artifact.
> >>>
> >>>>
> >>>> Also when you say "the little cluster capacity is very small nowadays
> >>>> (around 200 or less)", it is the capacity of 1 core or the cluster ?
> >>>
> >>> I meant one core. So in my case all the littles were busy except for one that
> >>> was mostly idle and never pulled a task from mid where two tasks were stuck on
> >>> a CPU there. And the logs I have added were showing me that the env->imbalance
> >>> was on 150+ range but the task we pull was in the 350+ range.
> >>
> >> I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
> >> which is expected because we share cache and weight is 1 so we use the path
> >>
> >> 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >> 			/*
> >> 			 * When prefer sibling, evenly spread running tasks on
> >> 			 * groups.
> >> 			 */
> >> 			env->migration_type = migrate_task;
> >> 			env->imbalance = sibling_imbalance(env, sds, busiest, local);
> >> 		} else {
> >>
> > 
> > I missed the deps on topology. So yes you're right, this needs to be addressed
> > first. I seem to remember Sudeep merged some stuff that will flatten these
> > topologies.
> > 
> > Let me chase this topology thing out first.
> 
> Sudeeps patches align topology cpumasks with cache cpumasks.
> 
> tip/sched/core:
> 
> root@...o:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 3f
> 3f
> 3f
> 3f
> 3f
> 3f
> 
> v5.9:
> 
> root@...o:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 39
> 06
> 06
> 39
> 39
> 39
> 
> So Android userspace won't be able to detect uArch boundaries via
> `package_cpus` any longer.
> 
> The phantom domain (DIE) in Android is a legacy decision from within
> Android. Pre-mainline Energy Model was attached to the sched domain
> topology hierarchy. And then IMHO other Android functionality start to
> rely on this. It could be removed regardless of Sudeeps patches in case
> Android would be OK with it.
> 
> The phantom domain is probably set up via DT cpu_map entry:
> 
> cpu-map {
>   cluster0 { <-- enforce phantom domain
>     core0 {
>       cpu = <&CPU0>;
>     };
> ...
>     core3 {
>       cpu = <&CPU1>;
>     };
>   cluster1 {
> ...
> 
> Juno (arch/arm64/boot/dts/arm/juno.dts) also uses cpu-map to enforce
> uarch boundaries on DIE group boundary congruence.
> 
> tip/sched/core:
> 
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
> 
> v5.9:
> 
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
> 
> We had a talk at LPC '22 about the influence of the patch-set and the
> phantom domain legacy issue:
> 
> https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
> 
> [...]

Thanks Dietmar!

So I actually moved everything to a single cluster and this indeed solves the
lb() issue. But then when I tried to look at DT mainline I saw that the DTs
still define separate cluster for each uArch, and this got me confused whether
I did the right thing or not. And made me wonder whether the fix is to change
DT or port Sudeep's/Ionela's patch?

I did some digging and I think the DT, like the ones in mainline by the look of
it, stayed the way it was historically defined.

So IIUC the impacts are on system pre-simplified EM (should have been phased
out AFAIK). And on different presentation on sysfs topology which can
potentially break userspace deps, right? I think this is not a problem too, but
can be famous last words as usual :-)


Thanks

--
Qais Yousef