linux-kernel - Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221220115130.lhhakj36kn3opqtz@airbuntu>
Date:   Tue, 20 Dec 2022 11:51:30 +0000
From:   Qais Yousef <qyousef@...alina.io>
To:     Lukasz Luba <lukasz.luba@....com>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        linux-pm@...r.kernel.org,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel@...r.kernel.org, Wei Wang <wvw@...gle.com>,
        Xuewen Yan <xuewen.yan94@...il.com>,
        Hank <han.lin@...iatek.com>,
        Jonathan JMChen <Jonathan.JMChen@...iatek.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect
 capacity inversion

On 12/13/22 17:42, Lukasz Luba wrote:
> Hi Qais,
> 
> I thought I could help with this issue.

Thanks Lukasz!

> 
> On 12/12/22 18:43, Qais Yousef wrote:
> > On 12/09/22 17:47, Vincent Guittot wrote:
> > 
> > [...]
> > 
> > > > > > > This patch loops on all cpufreq policy in sched softirq, how can this
> > > > > > > be sane ? and not only in eas mode but also in the default asymmetric
> > > > > > 
> > > > > > Hmm I'm still puzzled. Why it's not sane to do it here but it's okay to do it
> > > > > > in the wake up path in feec()?
> > > > > 
> > > > > feec() should be considered as an exception not as the default rule.
> > > > > Thing like above which loops for_each on external subsystem should be
> > > > > prevented and the fact that feec loops all PDs doesn't means that we
> > > > > can put that everywhere else
> > > > 
> > > > Fair enough. But really understanding the root cause behind this limitation
> > > > will be very helpful. I don't have the same appreciation of why this is
> > > > a problem, and shedding more light will help me to think more about it in the
> > > > future.
> > > > 
> > > 
> > > Take the example of 1k cores with per cpu policy. Do you really think a
> > > for_each_cpufreq_policy would be reasonable ?
> > 
> > Hmm I don't think such an HMP system makes sense to ever exist.
> > 
> > That system has to be a multi-socket system and I doubt inversion detection is
> > something of value.
> > 
> > Point taken anyway. Let's find another way to do this.
> > 
> 
> Another way might be to use the 'update' code path, which sets this
> information source, for the thermal pressure. That code path isn't as
> hot as this in the task scheduler. Furthermore, we would also
> have time and handle properly CPU hotplug callbacks there.
> 
> So something like this, I have in mind:
> 
> ------------------------------8<-----------------------------
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index e7d6e6657ffa..7f372a93e21b 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -16,6 +16,7 @@
>  #include <linux/sched/topology.h>
>  #include <linux/cpuset.h>
>  #include <linux/cpumask.h>
> +#include <linux/mutex.h>
>  #include <linux/init.h>
>  #include <linux/rcupdate.h>
>  #include <linux/sched.h>
> @@ -153,6 +154,33 @@ void topology_set_freq_scale(const struct cpumask
> *cpus, unsigned long cur_freq,
>  DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
>  DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
>  EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);
> 
> +static struct cpumask highest_capacity_mask;
> 
> +static struct cpumask highest_capacity_mask;
> +static unsigned int max_possible_capacity;
> +static DEFINE_MUTEX(max_capacity_lock);
> +
> +static void max_capacity_update(const struct cpumask *cpus,
> +                               unsigned long capacity)
> +{
> +       mutex_lock(&max_capacity_lock);
> +
> +       if (max_possible_capacity < capacity) {
> +               max_possible_capacity = capacity;
> +
> +               cpumask_clear(&highest_capacity_mask);
> +
> +               cpumask_or(&highest_capacity_mask,
> +                          &highest_capacity_mask, cpus);
> +       }
> +
> +       mutex_unlock(&max_capacity_lock);
> +}
> +
> +bool topology_test_max_cpu_capacity(unsigned int cpu)
> +{
> +       return cpumask_test_cpu(cpu, &highest_capacity_mask);
> +}
> +EXPORT_SYMBOL_GPL(topology_test_max_cpu_capacity);
> +
>  void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
>  {
>         per_cpu(cpu_scale, cpu) = capacity;
> @@ -203,6 +231,8 @@ void topology_update_thermal_pressure(const struct
> cpumask *cpus,
> 
>         for_each_cpu(cpu, cpus)
>                 WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
> +
> +       max_capacity_update(cpus, capacity);
>  }
>  EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);
> 
> 
> --------------------------->8--------------------------------
> 
> We could use the RCU if there is a potential to read racy date
> while the updater modifies the mask in the meantime. Mutex is to
> serialize the thermal writers which might be kicked for two
> policies at the same time.
> 
> If you like I can develop and test such code in the arch_topology.c

As we discussed offline, Vincent is keen on decoupling the util_fits_cpu()
logic from HMP - which means I need to reword this differently.

Let's keep this in the back burner in case we need to revisit it again.

Appreciate the proposal!!


Many thanks

--
Qais Yousef