linux-kernel - Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect capacity inversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <19bd3f60-63ea-4ccc-b5a2-6507276c8f0d@arm.com>
Date:   Tue, 13 Dec 2022 17:42:40 +0000
From:   Lukasz Luba <lukasz.luba@....com>
To:     Qais Yousef <qyousef@...alina.io>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>,
        linux-pm@...r.kernel.org,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel@...r.kernel.org, Wei Wang <wvw@...gle.com>,
        Xuewen Yan <xuewen.yan94@...il.com>,
        Hank <han.lin@...iatek.com>,
        Jonathan JMChen <Jonathan.JMChen@...iatek.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect
 capacity inversion

Hi Qais,

I thought I could help with this issue.

On 12/12/22 18:43, Qais Yousef wrote:
> On 12/09/22 17:47, Vincent Guittot wrote:
> 
> [...]
> 
>>>>>> This patch loops on all cpufreq policy in sched softirq, how can this
>>>>>> be sane ? and not only in eas mode but also in the default asymmetric
>>>>>
>>>>> Hmm I'm still puzzled. Why it's not sane to do it here but it's okay to do it
>>>>> in the wake up path in feec()?
>>>>
>>>> feec() should be considered as an exception not as the default rule.
>>>> Thing like above which loops for_each on external subsystem should be
>>>> prevented and the fact that feec loops all PDs doesn't means that we
>>>> can put that everywhere else
>>>
>>> Fair enough. But really understanding the root cause behind this limitation
>>> will be very helpful. I don't have the same appreciation of why this is
>>> a problem, and shedding more light will help me to think more about it in the
>>> future.
>>>
>>
>> Take the example of 1k cores with per cpu policy. Do you really think a
>> for_each_cpufreq_policy would be reasonable ?
> 
> Hmm I don't think such an HMP system makes sense to ever exist.
> 
> That system has to be a multi-socket system and I doubt inversion detection is
> something of value.
> 
> Point taken anyway. Let's find another way to do this.
> 

Another way might be to use the 'update' code path, which sets this
information source, for the thermal pressure. That code path isn't as
hot as this in the task scheduler. Furthermore, we would also
have time and handle properly CPU hotplug callbacks there.

So something like this, I have in mind:

------------------------------8<-----------------------------
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7d6e6657ffa..7f372a93e21b 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -16,6 +16,7 @@
  #include <linux/sched/topology.h>
  #include <linux/cpuset.h>
  #include <linux/cpumask.h>
+#include <linux/mutex.h>
  #include <linux/init.h>
  #include <linux/rcupdate.h>
  #include <linux/sched.h>
@@ -153,6 +154,33 @@ void topology_set_freq_scale(const struct cpumask 
*cpus, unsigned long cur_freq,
  DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
  DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
  EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);

+static struct cpumask highest_capacity_mask;

+static struct cpumask highest_capacity_mask;
+static unsigned int max_possible_capacity;
+static DEFINE_MUTEX(max_capacity_lock);
+
+static void max_capacity_update(const struct cpumask *cpus,
+                               unsigned long capacity)
+{
+       mutex_lock(&max_capacity_lock);
+
+       if (max_possible_capacity < capacity) {
+               max_possible_capacity = capacity;
+
+               cpumask_clear(&highest_capacity_mask);
+
+               cpumask_or(&highest_capacity_mask,
+                          &highest_capacity_mask, cpus);
+       }
+
+       mutex_unlock(&max_capacity_lock);
+}
+
+bool topology_test_max_cpu_capacity(unsigned int cpu)
+{
+       return cpumask_test_cpu(cpu, &highest_capacity_mask);
+}
+EXPORT_SYMBOL_GPL(topology_test_max_cpu_capacity);
+
  void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
  {
         per_cpu(cpu_scale, cpu) = capacity;
@@ -203,6 +231,8 @@ void topology_update_thermal_pressure(const struct 
cpumask *cpus,

         for_each_cpu(cpu, cpus)
                 WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
+
+       max_capacity_update(cpus, capacity);
  }
  EXPORT_SYMBOL_GPL(topology_update_thermal_pressure);


--------------------------->8--------------------------------

We could use the RCU if there is a potential to read racy date
while the updater modifies the mask in the meantime. Mutex is to
serialize the thermal writers which might be kicked for two
policies at the same time.

If you like I can develop and test such code in the arch_topology.c

Regards,
Lukasz