linux-kernel - Re: [RFC 0/2] Add RISC-V cpu topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <abcae019e6ae749b17ecb0c721fd3177@mailhost.ics.forth.gr>
Date:   Wed, 07 Nov 2018 04:31:34 +0200
From:   Nick Kossifidis <mick@....forth.gr>
To:     Mark Rutland <mark.rutland@....com>,
        Sudeep Holla <sudeep.holla@....com>
Cc:     Nick Kossifidis <mick@....forth.gr>,
        Atish Patra <atish.patra@....com>,
        linux-riscv@...ts.infradead.org, devicetree@...r.kernel.org,
        Damien.LeMoal@....com, alankao@...estech.com, zong@...estech.com,
        anup@...infault.org, palmer@...ive.com,
        linux-kernel@...r.kernel.org, hch@...radead.org,
        robh+dt@...nel.org, tglx@...utronix.de
Subject: Re: [RFC 0/2] Add RISC-V cpu topology

Στις 2018-11-06 18:20, Mark Rutland έγραψε:
> On Tue, Nov 06, 2018 at 05:26:01PM +0200, Nick Kossifidis wrote:
>> Στις 2018-11-06 16:13, Sudeep Holla έγραψε:
>> > On Fri, Nov 02, 2018 at 08:58:39PM +0200, Nick Kossifidis wrote:
>> > > Στις 2018-11-02 01:04, Atish Patra έγραψε:
>> > > > This patch series adds the cpu topology for RISC-V. It contains
>> > > > both the DT binding and actual source code. It has been tested on
>> > > > QEMU & Unleashed board.
>> > > >
>> > > > The idea is based on cpu-map in ARM with changes related to how
>> > > > we define SMT systems. The reason for adopting a similar approach
>> > > > to ARM as I feel it provides a very clear way of defining the
>> > > > topology compared to parsing cache nodes to figure out which cpus
>> > > > share the same package or core.  I am open to any other idea to
>> > > > implement cpu-topology as well.
>> > >
>> > > I was also about to start a discussion about CPU topology on RISC-V
>> > > after the last swtools group meeting. The goal is to provide the
>> > > scheduler with hints on how to distribute tasks more efficiently
>> > > between harts, by populating the scheduling domain topology levels
>> > > (https://elixir.bootlin.com/linux/v4.19/ident/sched_domain_topology_level).
>> > > What we want to do is define cpu groups and assign them to
>> > > scheduling domains with the appropriate SD_ flags
>> > > (https://github.com/torvalds/linux/blob/master/include/linux/sched/topology.h#L16).
>> >
>> > OK are we defining a CPU topology binding for Linux scheduler ?
>> > NACK for all the approaches that assumes any knowledge of OS scheduler.
>> 
>> Is there any standard regarding CPU topology on the device tree spec ?
>> As far as I know there is none. We are talking about a Linux-specific
>> Device Tree binding so I don't see why defining a binding for the
>> Linux scheduler is out of scope.
> 
> Speaking as a DT binding maintainer, please avoid OS-specific DT
> bindings wherever possible.
> 
> While DT bindings live in the kernel tree, they are not intended to be
> Linux-specific, and other OSs (e.g. FreeBSD, zephyr) are aiming to
> support the same bindings.
> 
> In general, targeting a specific OS is a bad idea, because the
> implementation details of that OS change over time, or the bindings end
> up being tailored to one specific use-case. Exposing details to the OS
> such that the OS can make decisions at runtime is typically better.
> 
>> Do you have cpu-map on other OSes as well ?
> 
> There is nothing OS-specific about cpu-map, and it may be of use to
> other OSs.
> 
>> > > So the cores that belong to a scheduling domain may share:
>> > > CPU capacity (SD_SHARE_CPUCAPACITY / SD_ASYM_CPUCAPACITY)
>> > > Package resources -e.g. caches, units etc- (SD_SHARE_PKG_RESOURCES)
>> > > Power domain (SD_SHARE_POWERDOMAIN)
>> > >
>> >
>> > Too Linux kernel/scheduler specific to be part of $subject
>> 
>> All lists on the cc list are Linux specific, again I don't see your
>> point here are we talking about defining a standard CPU topology
>> scheme for the device tree spec or a Linux-specific CPU topology
>> binding such as cpu-map ?
> 
> The cpu-map binding is not intended to be Linux specific, and avoids
> Linux-specific terminology.
> 
> While the cpu-map binding documentation is in the Linux source tree, 
> the
> binding itseld is not intended to be Linux-specific, and it 
> deliberately
> avoids Linux implementation details.
> 
>> Even on this case your point is not valid, the information of two
>> harts sharing a common power domain or having the same or not
>> capacity/max frequency (or maybe capabilities/extensions in the
>> future), is not Linux specific. I just used the Linux specific macros
>> used by the Linux scheduler to point out the code path.  Even on other
>> OSes we still need a way to include this information on the CPU
>> topology, and currently cpu-map doesn't. Also the Linux implementation
>> of cpu-map ignores multiple levels of shared resources, we only get
>> one level for SMT and one level for MC last time I checked.
> 
> Given clusters can be nested, as in the very first example, I don't see
> what prevents multiple levels of shared resources.
> 
> Can you please given an example of the topology your considering? Does
> that share some resources across clusters at some level?
> 
> We are certainly open to improving the cpu-map binding.
> 
> Thanks,
> Mark.

Mark and Sundeep thanks a lot for your feedback, I guess you convinced 
me
that having a device tree binding for the scheduler is not a correct 
approach.
It's not a device after all and I agree that the device tree shouldn't 
become
an OS configuration file. Regarding multiple levels of shared resources 
my point
is that since cpu-map doesn't contain any information of what is shared 
among
the cluster/core members it's not easy to do any further translation. 
Last time
I checked the arm code that uses cpu-map, it only defines one domain for 
SMT, one
for MC and then everything else is ignored. No matter how many clusters 
have been
defined, anything above the core level is the same (and then I guess you 
started
talking about adding "packages" on the representation side).

The reason I proposed to have a binding for the scheduler directly is 
not only
because it's simpler and closer to what really happens in the code, it 
also makes
more sense to me than the combination of cpu-map with all the related 
mappings e.g.
for numa or caches or power domains etc.

However you are right we could definitely augment cpu-map to include 
support for
what I'm saying and clean things up, and since you are open about 
improving it
here is a proposal that I hope you find interesting:

At first let's get rid of the <thread> nodes, they don't make sense:

thread0 {
  cpu = <&CPU0>;
};

A thread node can't have more than one cpu entry and any properties
should be on the cpu node itself, so it doesn't / can't add any
more information. We could just have an array of cpu nodes on the
<core> node, it's much cleaner this way.

core0 {
  members = <&CPU0>, <&CPU1>;
};

Then let's allow the cluster and core nodes to accept attributes that 
are
common for the cpus they contain. Right now this is considered invalid.

For power domains we have a generic binding described on
Documentation/devicetree/bindings/power/power_domain.txt
which basically says that we need to put power-domains = <power domain 
specifiers>
attribute on each of the cpu nodes.

The same happens with the capacity binding specified for arm on
Documentation/devicetree/bindings/arm/cpu-capacity.txt
which says we should add the capacity-dmips-mhz on each of the cpu 
nodes.

The same also happens with the generic numa binding on
Documentation/devicetree/bindings/numa.txt
which says we should add the nuna-node-id on each of the cpu nodes.

We could allow for these attributes to exist on cluster and core nodes
as well so that we can represent their properties better. It shouldn't
be a big deal and it can be done in a backwards-compatible way (if we
don't find them on the cpu node, climb up the topology hierarchy until
we find them / not find them at all). All I'm saying is that I prefer 
this:

cpus {
  cpu@0 {
   ...
  };
  cpu@1 {
   ...
  };
  cpu@2 {
   ...
  };
  cpu@3 {
   ...
  };
};


cluster0 {
  cluster0 {
   core0 {
    power-domains = <&pdc 0>;
    numa-node-id = <0>;
    capacity-dmips-mhz = <578>;
    members = <&cpu0>, <&cpu1>;
   }
  };
  cluster1 {
   capacity-dmips-mhz = <1024>;
   core0 {
    power-domains = <&pdc 1>;
    numa-node-id = <1>;
    members = <&cpu2>;
   };
   core1 {
    power-domains = <&pdc 2>;
    numa-node-id = <2>;
    members = <&cpu3>;
   };
  };
}

over this:

cpus {
  cpu@0 {
   ...
   power-domains = <&pdc 0>;
   capacity-dmips-mhz = <578>;
   numa-node-id = <0>;
   ...
  };
  cpu@1 {
   ...
   power-domains = <&pdc 0>;
   capacity-dmips-mhz = <578>;
   numa-node-id = <0>;
   ...
  };
  cpu@2 {
   ...
   power-domains = <&pdc 1>;
   capacity-dmips-mhz = <1024>;
   numa-node-id = <1>;
   ...
  };
  cpu@3 {
   ...
   power-domains = <&pdc 2>;
   capacity-dmips-mhz = <1024>;
   numa-node-id = <2>;
   ...
  };
};


cluster0 {
  cluster0 {
   core0 {
    members = <&cpu0>, <&cpu1>;
   }
  };
  cluster1 {
   core0 {
    members = <&cpu2>;
   }
  };
  cluster2 {
   core0 {
    members = <&cpu3>;
   }
  };
}


When it comes to shared resources, the standard dt mappings we have are 
for
caches and are on the device spec standard (coming from power pc's ePAPR
standard I think). The below comes from HiFive unleashed's device tree
(U540Config.dts) that follows the spec:

cpus {
  cpu@1 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@2 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@3 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
  cpu@4 {
   ...
   next-level-cache = <&L24 &L0>;
   ...
  };
};

L2: soc {
  L0: cache-controller@...0000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
}

Note that the cache-controller node that's common between the 4 cores 
can
exist anywhere BUT the cluster node ! However it's a property of the 
cluster.
A quick search through the tree got me r8a77980.dtsi that defines the 
cache
on the cpus node and I'm sure there are other similar cases. Wouldn't 
this
be better ?

cluster0 {
  core0 {
   cache-controller@...0000 {
    cache-block-size = <64>;
    cache-level = <2>;
    cache-sets = <2048>;
    cache-size = <2097152>;
    cache-unified;
    compatible = "sifive,ccache0", "cache";
    ...
   };
   members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
  };
};

We could even remove next-level-cache from the cpu nodes and infer it 
from the
topology (search the topology upwards until we get a node that's
"cache"-compatible), we can again make this backwards-compatible.


Finally from the examples above I'd like to stress out that the 
distinction
between a cluster and a core doesn't make much sense and it also makes 
the
representation more complicated. To begin with, how would you call the 
setup
on HiFive Unleashed ? A cluster of 4 cores that share the same L3 cache 
?
One core with 4 harts that share the same L3 cache ? We could represent 
it
like this instead:

cluster0 {
  cache-controller@...0000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
  core0 {
   members = <&cpu0>;
  };
  core1 {
   members = <&cpu1>;
  };
  core2 {
   members = <&cpu2>;
  };
  core3 {
   members = <&cpu3>;
  };
};

We could e.g. keep only cluster nodes and allow them to contain either 
an array
of harts or other cluster sub-nodes + optionally a set of attributes, 
common to
the members/sub-nodes of the cluster. This way we'll get in the first 
example:

cluster0 {
  cluster0 {
   power-domains = <&pdc 0>;
   numa-node-id = <0>;
   capacity-dmips-mhz = <578>;
   members = <&cpu0>, <&cpu1>;
  };
  cluster1 {
   capacity-dmips-mhz = <1024>;
   cluster0 {
    power-domains = <&pdc 1>;
    numa-node-id = <1>;
    members = <&cpu2>;
   };
   cluster1 {
    power-domains = <&pdc 2>;
    numa-node-id = <2>;
    members = <&cpu3>;
   };
  };
}

and in the second example:

cluster0 {
  cache-controller@...0000 {
   cache-block-size = <64>;
   cache-level = <2>;
   cache-sets = <2048>;
   cache-size = <2097152>;
   cache-unified;
   compatible = "sifive,ccache0", "cache";
   ...
  };
  members = <&cpu0>, <&cpu1>, <&cpu2>, <&cpu3>;
};


Thank you for your time !

Regards,
Nick