linux-kernel - [PATCH] sched: topology: make cache topology separate from cpu topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <SL2PR06MB3082FA912CBF7F0DCF758AC8BDE69@SL2PR06MB3082.apcprd06.prod.outlook.com>
Date:   Thu, 7 Apr 2022 02:31:22 +0000
From:   王擎 <wangqing@...o.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
CC:     Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Sudeep Holla <sudeep.holla@....com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: [PATCH] sched: topology: make cache topology separate from cpu
 topology


>>
>>
>> >>
>> >>
>> >> >>
>> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >On Thu, 10 Mar 2022 at 13:59, Qing Wang <wangqing@...o.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> From: Wang Qing <wangqing@...o.com>
>> >> >> >> >> >>
>> >> >> >> >> >> Some architectures(e.g. ARM64), caches are implemented below:
>> >> >> >> >> >> cluster:                     ****** cluster 0 *****      ****** cluster 1 *****
>> >> >> >> >> >> core:                         0      1          2      3          4      5           6      7
>> >> >> >> >> (add cache level 1)        c0    c1        c2    c3         c4    c5         c6    c7
>> >> >> >> >> >> cache(Leveln):         **cache0**  **cache1**  **cache2**  **cache3**
>> >> >> >> >> (add cache level 3)        *************share level 3 cache ***************
>> >> >> >> >> >> sd_llc_id(current):     0      0          0      0          4      4           4      4
>> >> >> >> >> >> sd_llc_id(should be): 0      0          2      2          4      4           6      6
>> >> >> >> >> >>
>> >> >> >> >> Here, n always be 2 in ARM64, but others are also possible.
>> >> >> >> >> core[0,1] form a complex(ARMV9),  which share L2 cache, core[2,3] is the same.
>> >> >> >> >>
>> >> >> >> >> >> Caches and cpus have different topology, this causes cpus_share_cache()
>> >> >> >> >> >> return the wrong value, which will affect the CPU load balance.
>> >> >> >> >> >>
>> >> >> >> >> >What does your current scheduler topology  look like?
>> >> >> >> >> >
>> >> >> >> >> >For CPU 0 to 3, do you have the below ?
>> >> >> >> >> >DIE [0     -     3] [4-7]
>> >> >> >> >> >MC  [0] [1] [2] [3]
>> >> >> >> >>
>> >> >> >> >> The current scheduler topology consistent with CPU topology:
>> >> >> >> >> DIE  [0-7]
>> >> >> >> >> MC  [0-3] [4-7]  (SD_SHARE_PKG_RESOURCES)
>> >> >> >> >> Most Android phones have this topology.
>> >> >> >> >> >
>> >> >> >> >> >But you would like something like below for cpu 0-1 instead ?
>> >> >> >> >> >DIE [0     -     3] [4-7]
>> >> >> >> >> >CLS [0 - 1] [2 - 3]
>> >> >> >> >> >MC  [0] [1]
>> >> >> >> >> >
>> >> >> >> >> >with SD_SHARE_PKG_RESOURCES only set to MC level ?
>> >> >> >> >>
>> >> >> >> >> We don't change the current scheduler topology, but the
>> >> >> >> >> cache topology should be separated like below:
>> >> >> >> >
>> >> >> >> >The scheduler topology is not only cpu topology but a mixed of cpu and
>> >> >> >> >cache/memory cache topology
>> >> >> >> >
>> >> >> >> >> [0-7]                          (shared level 3 cache )
>> >> >> >> >> [0-1] [2-3][4-5][6-7]   (shared level 2 cache )
>> >> >> >> >
>> >> >> >> >So you don't  bother the intermediate cluster level which is even simpler.
>> >> >> >> >you have to modify generic arch topology so that cpu_coregroup_mask
>> >> >> >> >returns the correct cpu mask directly.
>> >> >> >> >
>> >> >> >> >You will notice a llc_sibling field that is currently used by acpi but
>> >> >> >> >not DT to return llc cpu mask
>> >> >> >> >
>> >> >> >> cpu_topology[].llc_sibling describe the last level cache of whole system,
>> >> >> >> not in the sched_domain.
>> >> >> >>
>> >> >> >> in the above cache topology, llc_sibling is 0xff([0-7]) , it describes
>> >> >> >
>> >> >> >If llc_sibling was 0xff([0-7] on your system, you would have only one level:
>> >> >> >MC[0-7]
>> >> >>
>> >> >> Sorry, but I don't get it, why llc_sibling was 0xff([0-7] means MC[0-7]?
>> >> >> In our system(Android), llc_sibling is indeed 0xff([0-7]) , because they
>> >> >> shared the llc(L3), but we also have two level:
>> >> >> DIE [0-7]
>> >> >> MC [0-3][4-6]
>> >> >> It makes sense, [0-3] are little cores, [4-7] are bit cores, se only up migrate
>> >> >> when misfit. We won't change it.
>> >> >>
>> >> >> >
>> >> >> >> the L3 cache sibling, but sd_llc_id describes the maximum shared cache
>> >> >> >> in sd, which should be [0-1] instead of [0-3].
>> >> >> >
>> >> >> >sd_llc_id describes the last sched_domain with SD_SHARE_PKG_RESOURCES.
>> >> >> >If you want llc to be [0-3] make sure that the
>> >> >> >sched_domain_topology_level array returns the correct cpumask with
>> >> >> >this flag
>> >> >>
>> >> >> Acturely, we want sd_llc to be [0-1] [2-3], but if the MC domain don't have
>> >> >
>> >> >sd_llc_id refers to a scheduler domain but your patch breaks this so
>> >> >if you want a llc that reflects this topo:  [0-1] [2-3] you must
>> >> >provide a sched_domain level with this topo
>> >>
>> >> Maybe we should add a shared-cache level(SC), like what CLS does:
>> >>
>> >> DIE  [0-7] (shared level 3 cache, SD_SHARE_PKG_RESOURCES)
>> >> MC  [0-3] [4-7]  (not SD_SHARE_PKG_RESOURCES)
>> >> CLS  (if necessary)
>> >> SC    [0-1][2-3][4-5][6-7] (shared level 2 cache, SD_SHARE_PKG_RESOURCES)
>> >> SMT (if necessary)
>> >>
>> >> SC means a couple of CPUs which are placed closely by sharing
>> >> mid-level caches, but not enough to be a cluster.
>> >
>> >what you name SC above looks the same as CLS which should not be mixed
>> >with Arm cluster terminology
>>
>> Do you mean cluster is equal to shared cache instead of containing, SC just
>> means shared cache, but not form a cluster, a CLS can contain many SCs.
>
>CLS in the scheduler topology is not strictly tied to the "Arm
>cluster" but it's the generic name to describe an intermediate group
>of CPUs with common properties. CLS is also used by some intel
>platforms as an example. What I mean is that you can use the scheduler
>CLS level to describe what you call an Arm SC level.

It won't work, because cluster_sibling is assigned according to cluster_id, 
which is strictly tied to the "Arm cluster".
And if we have used CLS to describe the cluster sd, how do we describe
shared cache sd, like complex, which shared self cache within a cluster.
>
>>
>> If as you said, SC looks the same as CLS, should we rename CLS to SC to
>> avoid confusion?
>
>CLS is a generic scheduler name and I don't think that we need to
>rename it to a Arm specific label

I still insist on adding sc level within the cls, because maybe we have 
already used CLS to describe the cluster sd, please consider about it.

Thanks,
Wang

>
>Thanks,
>Vincent
>
>>
>> Thanks,
>> Wang