linux-kernel - RE: [PATCH] sched: topology: make cache topology separate from cpu topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SL2PR06MB3082ED191BF8892367A3465EBD109@SL2PR06MB3082.apcprd06.prod.outlook.com>
Date:   Tue, 15 Mar 2022 01:58:30 +0000
From:   王擎 <wangqing@...o.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
CC:     Catalin Marinas <Catalin.Marinas@....com>,
        Will Deacon <will@...nel.org>,
        Sudeep Holla <sudeep.holla@....com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] sched: topology: make cache topology separate from cpu
 topology


>>
>>
>> >>
>> >>
>> >> >>
>> >> >>
>> >> >> >On Thu, 10 Mar 2022 at 13:59, Qing Wang <wangqing@...o.com> wrote:
>> >> >> >>
>> >> >> >> From: Wang Qing <wangqing@...o.com>
>> >> >> >>
>> >> >> >> Some architectures(e.g. ARM64), caches are implemented below:
>> >> >> >> cluster:                     ****** cluster 0 *****      ****** cluster 1 *****
>> >> >> >> core:                         0      1          2      3          4      5           6      7
>> >> >> (add cache level 1)        c0    c1        c2    c3         c4    c5         c6    c7
>> >> >> >> cache(Leveln):         **cache0**  **cache1**  **cache2**  **cache3**
>> >> >> (add cache level 3)        *************share level 3 cache ***************
>> >> >> >> sd_llc_id(current):     0      0          0      0          4      4           4      4
>> >> >> >> sd_llc_id(should be): 0      0          2      2          4      4           6      6
>> >> >> >>
>> >> >> Here, n always be 2 in ARM64, but others are also possible.
>> >> >> core[0,1] form a complex(ARMV9),  which share L2 cache, core[2,3] is the same.
>> >> >>
>> >> >> >> Caches and cpus have different topology, this causes cpus_share_cache()
>> >> >> >> return the wrong value, which will affect the CPU load balance.
>> >> >> >>
>> >> >> >What does your current scheduler topology  look like?
>> >> >> >
>> >> >> >For CPU 0 to 3, do you have the below ?
>> >> >> >DIE [0     -     3] [4-7]
>> >> >> >MC  [0] [1] [2] [3]
>> >> >>
>> >> >> The current scheduler topology consistent with CPU topology:
>> >> >> DIE  [0-7]
>> >> >> MC  [0-3] [4-7]  (SD_SHARE_PKG_RESOURCES)
>> >> >> Most Android phones have this topology.
>> >> >> >
>> >> >> >But you would like something like below for cpu 0-1 instead ?
>> >> >> >DIE [0     -     3] [4-7]
>> >> >> >CLS [0 - 1] [2 - 3]
>> >> >> >MC  [0] [1]
>> >> >> >
>> >> >> >with SD_SHARE_PKG_RESOURCES only set to MC level ?
>> >> >>
>> >> >> We don't change the current scheduler topology, but the
>> >> >> cache topology should be separated like below:
>> >> >
>> >> >The scheduler topology is not only cpu topology but a mixed of cpu and
>> >> >cache/memory cache topology
>> >> >
>> >> >> [0-7]                          (shared level 3 cache )
>> >> >> [0-1] [2-3][4-5][6-7]   (shared level 2 cache )
>> >> >
>> >> >So you don't  bother the intermediate cluster level which is even simpler.
>> >> >you have to modify generic arch topology so that cpu_coregroup_mask
>> >> >returns the correct cpu mask directly.
>> >> >
>> >> >You will notice a llc_sibling field that is currently used by acpi but
>> >> >not DT to return llc cpu mask
>> >> >
>> >> cpu_topology[].llc_sibling describe the last level cache of whole system,
>> >> not in the sched_domain.
>> >>
>> >> in the above cache topology, llc_sibling is 0xff([0-7]) , it describes
>> >
>> >If llc_sibling was 0xff([0-7] on your system, you would have only one level:
>> >MC[0-7]
>>
>> Sorry, but I don't get it, why llc_sibling was 0xff([0-7] means MC[0-7]?
>> In our system(Android), llc_sibling is indeed 0xff([0-7]) , because they
>> shared the llc(L3), but we also have two level:
>> DIE [0-7]
>> MC [0-3][4-6]
>> It makes sense, [0-3] are little cores, [4-7] are bit cores, se only up migrate
>> when misfit. We won't change it.
>>
>> >
>> >> the L3 cache sibling, but sd_llc_id describes the maximum shared cache
>> >> in sd, which should be [0-1] instead of [0-3].
>> >
>> >sd_llc_id describes the last sched_domain with SD_SHARE_PKG_RESOURCES.
>> >If you want llc to be [0-3] make sure that the
>> >sched_domain_topology_level array returns the correct cpumask with
>> >this flag
>>
>> Acturely, we want sd_llc to be [0-1] [2-3], but if the MC domain don't have
>
>sd_llc_id refers to a scheduler domain but your patch breaks this so
>if you want a llc that reflects this topo:  [0-1] [2-3] you must
>provide a sched_domain level with this topo

Maybe we should add a shared-cache level(SC), like what CLS does:

DIE  [0-7] (shared level 3 cache, SD_SHARE_PKG_RESOURCES)
MC  [0-3] [4-7]  (not SD_SHARE_PKG_RESOURCES)
CLS  (if necessary)
SC    [0-1][2-3][4-5][6-7] (shared level 2 cache, SD_SHARE_PKG_RESOURCES)
SMT (if necessary)

SC means a couple of CPUs which are placed closely by sharing 
mid-level caches, but not enough to be a cluster.
>
>side question, why don't you want llc to be the L3 one ?

Yes, we should set SD_SHARE_PKG_RESOURCES to DIE, but we also want to
represent the mid-level caches to improve throughput.

Thanks,
Wang
>
>> SD_SHARE_PKG_RESOURCES flag, the sd_llc will be [0][1][2][3]. It's not true.
>
>The only entry point for describing the scheduler domain is the
>sched_domain_topology_level array which provides some cpumask and some
>associated flags. By default, SD_SHARE_PKG_RESOURCES is set for
>scheduler MC level which implies that the cpus shared their cache. If
>this is not the case for your system, you should either remove this
>flag or update the cpumask to reflect which CPUs really share their
>caches.
>
>>
>> So we must separate sd_llc from sd topology, or the demand cannot be
>> met in any case under the existing mechanism.
>
>There is a default array with DIE, MC, CLS and SMT levels with
>SD_SHARE_PKG_RESOURCES set up to MC which is considered to be the LLC
>but a different array than the default one can be provided with
>set_sched_topology()
>
>Thanks
>Vincent
>
>>
>> Thanks,
>> Wang
>>