[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221112193522.g4hhpdlywndvik7r@airbuntu>
Date: Sat, 12 Nov 2022 19:35:22 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Ingo Molnar <mingo@...nel.org>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
linux-kernel@...r.kernel.org, Xuewen Yan <xuewen.yan94@...il.com>,
Lukasz Luba <lukasz.luba@....com>, Wei Wang <wvw@...gle.com>,
Jonathan JMChen <Jonathan.JMChen@...iatek.com>,
Hank <han.lin@...iatek.com>
Subject: Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion
On 11/09/22 11:42, Dietmar Eggemann wrote:
[...]
> > + /*
> > + * Detect if the performance domain is in capacity inversion state.
> > + *
> > + * Capacity inversion happens when another perf domain with equal or
> > + * lower capacity_orig_of() ends up having higher capacity than this
> > + * domain after subtracting thermal pressure.
> > + *
> > + * We only take into account thermal pressure in this detection as it's
> > + * the only metric that actually results in *real* reduction of
> > + * capacity due to performance points (OPPs) being dropped/become
> > + * unreachable due to thermal throttling.
> > + *
> > + * We assume:
> > + * * That all cpus in a perf domain have the same capacity_orig
> > + * (same uArch).
> > + * * Thermal pressure will impact all cpus in this perf domain
> > + * equally.
> > + */
> > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
>
> This should be sched_energy_enabled(). Performance Domains (PDs) are an
> EAS thing.
Bummer. I had a version that used cpumasks only, but I thought using pds is
cleaner and will save unnecessarily extra traversing. But I missed that it's
conditional on sched_energy_enabled().
This is not good news for CAS.
>
> > + unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
>
> rcu_read_lock()
>
> > + struct perf_domain *pd = rcu_dereference(rq->rd->pd);
>
> rcu_read_unlock()
Shouldn't we continue to hold it while traversing the pd too?
>
> It's called from build_sched_domains() too. I assume
> static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
>
> > +
> > + rq->cpu_capacity_inverted = 0;
> > +
> > + for (; pd; pd = pd->next) {
> > + struct cpumask *pd_span = perf_domain_span(pd);
> > + unsigned long pd_cap_orig, pd_cap;
> > +
> > + cpu = cpumask_any(pd_span);
> > + pd_cap_orig = arch_scale_cpu_capacity(cpu);
> > +
> > + if (capacity_orig < pd_cap_orig)
> > + continue;
> > +
> > + /*
> > + * handle the case of multiple perf domains have the
> > + * same capacity_orig but one of them is under higher
>
> Like I said above, I'm not aware of such an EAS system.
I did argue against that. But Vincent's PoV was that we shouldn't make
assumptions and handle the case where we have big cores each on its own domain.
>
> > + * thermal pressure. We record it as capacity
> > + * inversion.
> > + */
> > + if (capacity_orig == pd_cap_orig) {
> > + pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> > +
> > + if (pd_cap > inv_cap) {
> > + rq->cpu_capacity_inverted = inv_cap;
> > + break;
> > + }
>
> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> erroneously since thermal_load_avg(rq) can return different values for
> inv_cap and pd_cap.
Good catch!
>
> So even on a classical big little system, this condition can set
> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
>
> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> PD to avoid this.
>
> This is one example of the `thermal pressure` is per PD (or Frequency
> Domain) in Thermal but per-CPU in the task scheduler.
Only compile tested so far, does this patch address all your points? I should
get hardware soon to run some tests and send the patch. I might re-write it to
avoid using pds; though it seems cleaner this way but we miss CAS support.
Thoughts?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89dadaafc1ec..b01854984994 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8856,16 +8856,24 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* * Thermal pressure will impact all cpus in this perf domain
* equally.
*/
- if (static_branch_unlikely(&sched_asym_cpucapacity)) {
- unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
- struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+ if (sched_energy_enabled()) {
+ struct perf_domain *pd;
+ unsigned long inv_cap;
+
+ rcu_read_lock();
+ inv_cap = capacity_orig - thermal_load_avg(rq);
+ pd = rcu_dereference(rq->rd->pd);
rq->cpu_capacity_inverted = 0;
for (; pd; pd = pd->next) {
struct cpumask *pd_span = perf_domain_span(pd);
unsigned long pd_cap_orig, pd_cap;
+ /* We can't be inverted against our own pd */
+ if (cpumask_test_cpu(cpu_of(rq), pd_span))
+ continue;
+
cpu = cpumask_any(pd_span);
pd_cap_orig = arch_scale_cpu_capacity(cpu);
@@ -8890,6 +8898,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
break;
}
}
+
+ rcu_read_unlock();
}
Thanks!
--
Qais Yousef
Powered by blists - more mailing lists