[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52B87149.4010801@arm.com>
Date: Mon, 23 Dec 2013 18:22:17 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>,
"peterz@...radead.org" <peterz@...radead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "mingo@...nel.org" <mingo@...nel.org>,
"pjt@...gle.com" <pjt@...gle.com>,
Morten Rasmussen <Morten.Rasmussen@....com>,
"cmetcalf@...era.com" <cmetcalf@...era.com>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"alex.shi@...aro.org" <alex.shi@...aro.org>,
"preeti@...ux.vnet.ibm.com" <preeti@...ux.vnet.ibm.com>,
"linaro-kernel@...ts.linaro.org" <linaro-kernel@...ts.linaro.org>,
"rjw@...k.pl" <rjw@...k.pl>,
"paulmck@...ux.vnet.ibm.com" <paulmck@...ux.vnet.ibm.com>,
"corbet@....net" <corbet@....net>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"len.brown@...el.com" <len.brown@...el.com>,
"arjan@...ux.intel.com" <arjan@...ux.intel.com>,
"amit.kucheria@...aro.org" <amit.kucheria@...aro.org>,
"james.hogan@...tec.com" <james.hogan@...tec.com>,
"schwidefsky@...ibm.com" <schwidefsky@...ibm.com>,
"heiko.carstens@...ibm.com" <heiko.carstens@...ibm.com>
Subject: Re: [RFC] sched: CPU topology try
Hi Vincent,
On 18/12/13 14:13, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
I came up w/ a similar implementation proposal for an arch specific
interface for scheduler domain set-up a couple of days ago:
[1] https://lkml.org/lkml/2013/12/13/182
I had the following requirements in mind:
1) The arch should not be able to fine tune individual scheduler
behaviour, i.e. get rid of the arch specific SD_FOO_INIT macros.
2) Unify the set-up code for conventional and NUMA scheduler domains.
3) The arch is able to specify additional scheduler domain level, other
than SMT, MC, BOOK, and CPU.
4) Allow to integrate the provision of additional topology related data
(e.g. energy information) to the scheduler.
Moreover, I think now that:
5) Something like the existing default set-up via default_topology[] is
needed to avoid code duplication for archs not interested in (3) or (4).
I can see the following similarities w/ your implementation:
1) Move the cpu_foo_mask functions from scheduler to topology. I even
put cpu_smt_mask() and cpu_cpu_mask() into include/linux/topology.h.
2) Use the existing func ptr sched_domain_mask_f to pass per-cpu cpu
mask from the topology shim-layer to the scheduler.
>
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
>
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
>
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
>
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
> domain 1: span 8-15 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
>
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
>
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
> domain 1: span 2-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
> domain 3: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
>
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
I think the weakest point right now is the condition in sd_init() where
we convert the topology flags into scheduler behaviour. We not only
introduce a very tight coupling between topology flags and scheduler
domain level but also we need to follow a certain order in the
initialization. This bit needs more thinking.
>
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)
>
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.
I'm not sure if the idea to create a dedicated sched_domain level for
every topology flag representing a specific functionality will scale.
From the perspective of energy-aware scheduling we need e.g. energy
costs (P and C state) which can only be populated towards the scheduler
via an additional sub-struct and additional function arch_sd_energy()
like depicted in Morten's email:
[2] lkml.org/lkml/2013/11/14/102
>
> Regards
> Vincent
>
> Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
>
> ---
> arch/arm/include/asm/topology.h | 4 ++
> arch/arm/kernel/topology.c | 99 ++++++++++++++++++++++++++++++++++++++-
> include/linux/sched.h | 7 +++
> kernel/sched/core.c | 17 +++----
> 4 files changed, 116 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 58b8b84..5102847 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -5,12 +5,16 @@
>
> #include <linux/cpumask.h>
>
> +#define CPU_CORE_GATE 0x1
> +#define CPU_CLUSTER_GATE 0x2
> +
> struct cputopo_arm {
> int thread_id;
> int core_id;
> int socket_id;
> cpumask_t thread_sibling;
> cpumask_t core_sibling;
> + int flags;
> };
>
> extern struct cputopo_arm cpu_topology[NR_CPUS];
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 85a8737..8a2aec6 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -24,6 +24,7 @@
>
> #include <asm/cputype.h>
> #include <asm/topology.h>
> +#include <asm/smp_plat.h>
>
> /*
> * cpu power scale management
> @@ -79,6 +80,51 @@ unsigned long *__cpu_capacity;
>
> unsigned long middle_capacity = 1;
>
> +static int __init get_dt_power_topology(struct device_node *topo)
> +{
> + const u32 *reg;
> + int len, power = 0;
> + int flag = CPU_CORE_GATE;
> +
> + for (; topo; topo = of_get_next_parent(topo)) {
> + reg = of_get_property(topo, "power-gate", &len);
> + if (reg && len == 4 && be32_to_cpup(reg))
> + power |= flag;
> + flag <<= 1;
> + }
> +
> + return power;
> +}
> +
> +#define for_each_subnode_with_property(dn, pn, prop_name) \
> + for (dn = of_find_node_with_property(pn, prop_name); dn; \
> + dn = of_find_node_with_property(dn, prop_name))
> +
> +static void __init init_dt_power_topology(void)
> +{
> + struct device_node *cn, *topo;
> +
> + /* Get power domain topology information */
> + cn = of_find_node_by_path("/cpus/cpu-map");
> + if (!cn) {
> + pr_warn("Missing cpu-map node, bailing out\n");
> + return;
> + }
> +
> + for_each_subnode_with_property(topo, cn, "cpu") {
> + struct device_node *cpu;
> +
> + cpu = of_parse_phandle(topo, "cpu", 0);
> + if (cpu) {
> + u32 hwid;
> +
> + of_property_read_u32(cpu, "reg", &hwid);
> + cpu_topology[get_logical_index(hwid)].flags = get_dt_power_topology(topo);
> +
> + }
> + }
> +}
> +
> /*
> * Iterate all CPUs' descriptor in DT and compute the efficiency
> * (as per table_efficiency). Also calculate a middle efficiency
> @@ -151,6 +197,8 @@ static void __init parse_dt_topology(void)
> middle_capacity = ((max_capacity / 3)
> >> (SCHED_POWER_SHIFT-1)) + 1;
>
> + /* Retrieve power topology information from DT */
> + init_dt_power_topology();
> }
>
> /*
> @@ -266,6 +314,52 @@ void store_cpu_topology(unsigned int cpuid)
> cpu_topology[cpuid].socket_id, mpidr);
> }
>
> +#ifdef CONFIG_SCHED_SMT
> +static const struct cpumask *cpu_smt_mask(int cpu)
> +{
> + return topology_thread_cpumask(cpu);
> +}
> +#endif
> +
> +const struct cpumask *cpu_corepower_mask(int cpu)
> +{
> + if (cpu_topology[cpu].flags & CPU_CORE_GATE)
> + return &cpu_topology[cpu].thread_sibling;
> + else
> + return &cpu_topology[cpu].core_sibling;
> +}
> +
> +static const struct cpumask *cpu_cpupower_mask(int cpu)
> +{
> + if (cpu_topology[cpu].flags & CPU_CLUSTER_GATE)
> + return &cpu_topology[cpu].core_sibling;
> + else
> + return cpumask_of_node(cpu_to_node(cpu));
> +}
> +
> +static const struct cpumask *cpu_cpu_mask(int cpu)
> +{
> + return cpumask_of_node(cpu_to_node(cpu));
> +}
> +
> +static struct sched_domain_topology_level arm_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> + { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
> + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES},
> +#endif
> + { cpu_cpupower_mask, SD_SHARE_POWERDOMAIN },
> + { cpu_cpu_mask, },
> + { NULL, },
> +};
> +
> +static int __init arm_sched_topology(void)
> +{
> + sched_domain_topology = arm_topology;
return missing
> +}
> +
> /*
> * init_cpu_topology is called at boot when only one cpu is running
> * which prevent simultaneous write access to cpu_topology array
> @@ -274,6 +368,9 @@ void __init init_cpu_topology(void)
> {
> unsigned int cpu;
>
> + /* set scheduler topology descriptor */
> + arm_sched_topology();
> +
> /* init core mask and power*/
> for_each_possible_cpu(cpu) {
> struct cputopo_arm *cpu_topo = &(cpu_topology[cpu]);
> @@ -283,7 +380,7 @@ void __init init_cpu_topology(void)
> cpu_topo->socket_id = -1;
> cpumask_clear(&cpu_topo->core_sibling);
> cpumask_clear(&cpu_topo->thread_sibling);
> -
> + cpu_topo->flags = 0;
> set_power_scale(cpu, SCHED_POWER_SCALE);
> }
> smp_wmb();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 075a325..8cbaebf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -772,6 +772,7 @@ enum cpu_idle_type {
> #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
> #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
> #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */
> +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
> #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
> #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
> #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
> @@ -893,6 +894,12 @@ typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>
> #define SDTL_OVERLAP 0x01
>
> +struct sd_data {
> + struct sched_domain **__percpu sd;
> + struct sched_group **__percpu sg;
> + struct sched_group_power **__percpu sgp;
> +};
> +
> struct sched_domain_topology_level {
> sched_domain_mask_f mask;
> int sd_flags;
By exporting struct sched_domain_topology_level and struct sd_data in
include/linux/sched.h we're exposing a lot of internal scheduler data.
That's why I came up w/ a new struct arch_sched_domain_info_t which only
contains the cpu mask func ptr and the integer for the topology flags.
Best Regards,
-- Dietmar
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 73658da..8dc2a50 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4680,7 +4680,8 @@ static int sd_degenerate(struct sched_domain *sd)
> SD_BALANCE_FORK |
> SD_BALANCE_EXEC |
> SD_SHARE_CPUPOWER |
> - SD_SHARE_PKG_RESOURCES)) {
> + SD_SHARE_PKG_RESOURCES |
> + SD_SHARE_POWERDOMAIN)) {
> if (sd->groups != sd->groups->next)
> return 0;
> }
> @@ -4711,7 +4712,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
> SD_BALANCE_EXEC |
> SD_SHARE_CPUPOWER |
> SD_SHARE_PKG_RESOURCES |
> - SD_PREFER_SIBLING);
> + SD_PREFER_SIBLING |
> + SD_SHARE_POWERDOMAIN);
> if (nr_node_ids == 1)
> pflags &= ~SD_SERIALIZE;
> }
> @@ -4978,12 +4980,6 @@ static const struct cpumask *cpu_cpu_mask(int cpu)
> return cpumask_of_node(cpu_to_node(cpu));
> }
>
> -struct sd_data {
> - struct sched_domain **__percpu sd;
> - struct sched_group **__percpu sg;
> - struct sched_group_power **__percpu sgp;
> -};
> -
> struct s_data {
> struct sched_domain ** __percpu sd;
> struct root_domain *rd;
> @@ -5345,7 +5341,8 @@ static struct cpumask ***sched_domains_numa_masks;
> (SD_SHARE_CPUPOWER | \
> SD_SHARE_PKG_RESOURCES | \
> SD_NUMA | \
> - SD_ASYM_PACKING)
> + SD_ASYM_PACKING | \
> + SD_SHARE_POWERDOMAIN)
>
> static struct sched_domain *
> sd_init(struct sched_domain_topology_level *tl, int cpu)
> @@ -5464,7 +5461,7 @@ static struct sched_domain_topology_level default_topology[] = {
> { NULL, },
> };
>
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +struct sched_domain_topology_level *sched_domain_topology = default_topology;
>
> #define for_each_sd_topology(tl) \
> for (tl = sched_domain_topology; tl->mask; tl++)
> --
> 1.7.9.5
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists