linux-kernel - Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221112193522.g4hhpdlywndvik7r@airbuntu>
Date:   Sat, 12 Nov 2022 19:35:22 +0000
From:   Qais Yousef <qyousef@...alina.io>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Ingo Molnar <mingo@...nel.org>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel@...r.kernel.org, Xuewen Yan <xuewen.yan94@...il.com>,
        Lukasz Luba <lukasz.luba@....com>, Wei Wang <wvw@...gle.com>,
        Jonathan JMChen <Jonathan.JMChen@...iatek.com>,
        Hank <han.lin@...iatek.com>
Subject: Re: [PATCH v2 8/9] sched/fair: Detect capacity inversion

On 11/09/22 11:42, Dietmar Eggemann wrote:

[...]

> > +	/*
> > +	 * Detect if the performance domain is in capacity inversion state.
> > +	 *
> > +	 * Capacity inversion happens when another perf domain with equal or
> > +	 * lower capacity_orig_of() ends up having higher capacity than this
> > +	 * domain after subtracting thermal pressure.
> > +	 *
> > +	 * We only take into account thermal pressure in this detection as it's
> > +	 * the only metric that actually results in *real* reduction of
> > +	 * capacity due to performance points (OPPs) being dropped/become
> > +	 * unreachable due to thermal throttling.
> > +	 *
> > +	 * We assume:
> > +	 *   * That all cpus in a perf domain have the same capacity_orig
> > +	 *     (same uArch).
> > +	 *   * Thermal pressure will impact all cpus in this perf domain
> > +	 *     equally.
> > +	 */
> > +	if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> 
> This should be sched_energy_enabled(). Performance Domains (PDs) are an
> EAS thing.

Bummer. I had a version that used cpumasks only, but I thought using pds is
cleaner and will save unnecessarily extra traversing. But I missed that it's
conditional on sched_energy_enabled().

This is not good news for CAS.

> 
> > +		unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
> 
> rcu_read_lock()
> 
> > +		struct perf_domain *pd = rcu_dereference(rq->rd->pd);
> 
> rcu_read_unlock()

Shouldn't we continue to hold it while traversing the pd too?

> 
> It's called from build_sched_domains() too. I assume
> static_branch_unlikely(&sched_asym_cpucapacity) hides this issue so far.
> 
> > +
> > +		rq->cpu_capacity_inverted = 0;
> > +
> > +		for (; pd; pd = pd->next) {
> > +			struct cpumask *pd_span = perf_domain_span(pd);
> > +			unsigned long pd_cap_orig, pd_cap;
> > +
> > +			cpu = cpumask_any(pd_span);
> > +			pd_cap_orig = arch_scale_cpu_capacity(cpu);
> > +
> > +			if (capacity_orig < pd_cap_orig)
> > +				continue;
> > +
> > +			/*
> > +			 * handle the case of multiple perf domains have the
> > +			 * same capacity_orig but one of them is under higher
> 
> Like I said above, I'm not aware of such an EAS system.

I did argue against that. But Vincent's PoV was that we shouldn't make
assumptions and handle the case where we have big cores each on its own domain.

> 
> > +			 * thermal pressure. We record it as capacity
> > +			 * inversion.
> > +			 */
> > +			if (capacity_orig == pd_cap_orig) {
> > +				pd_cap = pd_cap_orig - thermal_load_avg(cpu_rq(cpu));
> > +
> > +				if (pd_cap > inv_cap) {
> > +					rq->cpu_capacity_inverted = inv_cap;
> > +					break;
> > +				}
> 
> In case `capacity_orig == pd_cap_orig` and cpumask_test_cpu(cpu_of(rq),
> pd_span) the code can set rq->cpu_capacity_inverted = inv_cap
> erroneously since thermal_load_avg(rq) can return different values for
> inv_cap and pd_cap.

Good catch!

> 
> So even on a classical big little system, this condition can set
> rq->cpu_capacity_inverted for a CPU in the little or big cluster.
> 
> thermal_load_avg(rq) would have to stay constant for all CPUs within the
> PD to avoid this.
> 
> This is one example of the `thermal pressure` is per PD (or Frequency
> Domain) in Thermal but per-CPU in the task scheduler.

Only compile tested so far, does this patch address all your points? I should
get hardware soon to run some tests and send the patch. I might re-write it to
avoid using pds; though it seems cleaner this way but we miss CAS support.

Thoughts?


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89dadaafc1ec..b01854984994 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8856,16 +8856,24 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
         *   * Thermal pressure will impact all cpus in this perf domain
         *     equally.
         */
-       if (static_branch_unlikely(&sched_asym_cpucapacity)) {
-               unsigned long inv_cap = capacity_orig - thermal_load_avg(rq);
-               struct perf_domain *pd = rcu_dereference(rq->rd->pd);
+       if (sched_energy_enabled()) {
+               struct perf_domain *pd;
+               unsigned long inv_cap;
+
+               rcu_read_lock();

+               inv_cap = capacity_orig - thermal_load_avg(rq);
+               pd = rcu_dereference(rq->rd->pd);
                rq->cpu_capacity_inverted = 0;

                for (; pd; pd = pd->next) {
                        struct cpumask *pd_span = perf_domain_span(pd);
                        unsigned long pd_cap_orig, pd_cap;

+                       /* We can't be inverted against our own pd */
+                       if (cpumask_test_cpu(cpu_of(rq), pd_span))
+                               continue;
+
                        cpu = cpumask_any(pd_span);
                        pd_cap_orig = arch_scale_cpu_capacity(cpu);

@@ -8890,6 +8898,8 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
                                break;
                        }
                }
+
+               rcu_read_unlock();
        }


Thanks!

--
Qais Yousef