linux-kernel - Re: [RFC PATCH] arch_topology: Use llc_id instead of package

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <beda044e-0b91-4359-c6bf-5e34c285fc5c@arm.com>
Date:   Tue, 17 May 2022 11:14:44 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Sudeep Holla <sudeep.holla@....com>
Cc:     Qing Wang <wangqing@...o.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Morten Rasmussen <morten.rasmussen@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] arch_topology: Use llc_id instead of package_id

On 16/05/2022 12:35, Sudeep Holla wrote:
> On Fri, May 13, 2022 at 02:04:29PM +0200, Dietmar Eggemann wrote:
>> On 13/05/2022 13:03, Sudeep Holla wrote:
>>> On Fri, May 13, 2022 at 12:42:00PM +0200, Dietmar Eggemann wrote:
>>>> On 13/05/2022 11:03, Sudeep Holla wrote:
>>>>> On Fri, May 13, 2022 at 10:34:00AM +0200, Dietmar Eggemann wrote:

[...]

>> cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>> CLS
>> MC
>> ... I skip the NUMA levels
>>
>> # cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -5
>> cpu0 0
>> domain0 00000000,00000000,0000000f <--  4 CPUs <-- cluster_id
>> domain1 00000000,00000000,00ffffff <-- 24 CPUs
>>
>> If you use cluster_id for 1. level cluster then you cover MC's 24 CPUs
> 
> OK, the way I am looking at this from CPU topology perspective seem to be
> different from what you are expecting here w.r.t sched_domains.
> 
> IMO, these cpumasks on its own must represent real CPU topology and how it
> is used via cpu_*_mask by the scheduler domains is different.

I'm afraid that in case we want to change the mapping of scheduler
(topology) flags (like SD_SHARE_PKG_RESOURCES) via `*_flags()` of
default_topology[] we would have to consider all ACPI corner cases (see
cpu_coregroup_mask()) as well.

See (1) further down.

[...]

> I see on Juno with SCHED_CLUSTER and cluster masks set, I see CLS and MC
> domains.

Right but that's not really correct from the scheduler's POV.

Juno has LLC on cpumasks [0,3-5] and [1-2], not on [0-5]. So the most
important information is the highest Sched Domain with the
SD_SHARE_PKG_RESOURCES flag. This is the MC layer (cpu_core_flags() in
default_topology[]). So the scheduler would think that [0-5] are sharing
LLC.

You have LLC at:

cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list
                                       ^^^^^^
0,3-5

but the scheduler sees the highest SD_SHARE_PKG_RESOURCES on MC:

cat /sys/kernel/debug/sched/domains/cpu0/domain1/flags
                                         ^^^^^^^
... SD_SHARE_PKG_RESOURCES ...

[...]

>> For one level (MC) yes, but not for 2 (MC and CLS). And cluster_id was
>> introduces for the 2. level.
>>
> 
> That sounds wrong and not what ACPI PPTT code says. My series just aligns
> with what is done with ACPI PPTT IIUC. I need to check that again if my
> understand differs from what is being done. But the example of Kunpeng920
> aligns with my understanding.

(1) IMHO, as long as we don't consider cache (llc) information in DT we
can't have the same coverage as ACPI.

Let's take an ACPI !CONFIG_NUMA Kunpeng920 as an example.

# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
CLS
MC
DIE

# cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
  SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
  ^^^^^^^^^^^^^^^^^^^^^^
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
  SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
  ^^^^^^^^^^^^^^^^^^^^^^
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
  SD_PREFER_SIBLING

cat /proc/schedstat | awk '{print $1 " " $2 }' | grep ^[cd] | head -4
cpu0 0
domain0 00000000,00000000,0000000f
domain1 00000000,00000000,00ffffff <-- (2)
domain2 ffffffff,ffffffff,ffffffff

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0
cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
0
cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
0-3
cat /sys/devices/system/cpu/cpu0/topology/core_siblings_list
0-47
cat /sys/devices/system/cpu/cpu0/topology/package_cpus_list
0-47

The MC mask 00ffffff is not derived from any topology mask but from the
llc (index3) mask:

cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
                                       ^^^^^^
0-23 <-- (2)


Coming back to the original request (the support of Armv9 L2 complexes
in Android) from Qing on systems like QC SM8450:

      .---------------.
CPU   |0 1 2 3 4 5 6 7|
      +---------------+
uarch |l l l l m m m b| (so called tri-gear: little, medium, big)
      +---------------+
  L2  |   |   | | | | |
      +---------------+
  L3  |<--         -->|
      +---------------+
      |<-- cluster -->|
      +---------------+
      |<--   DSU   -->|
      '---------------'

This still wouldn't be possible. We know that Phantom Domain (grouping
after uarch) is not supported in mainline but heavily used in Android
(legacy deps).

If we say we only support L2 sharing (asymmetric here since only CPU0-3
have it !!!) and we don't support Phantom then your approach will work
for such systems.