linux-kernel - [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3278518.5fSG56mABF@rjwysocki.net>
Date: Fri, 29 Nov 2024 17:15:45 +0100
From: "Rafael J. Wysocki" <rjw@...ysocki.net>
To: Linux PM <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Lukasz Luba <lukasz.luba@....com>,
 Peter Zijlstra <peterz@...radead.org>,
 Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Morten Rasmussen <morten.rasmussen@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Ricardo Neri <ricardo.neri-calderon@...ux.intel.com>,
 Pierre Gondois <pierre.gondois@....com>
Subject:
 [RFC][PATCH v021 7/9] PM: EM: Register perf domains with ho :active_power()
 callbacks

From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>

Allow em_dev_register_perf_domain() to register a cost-only perf domain
with a one-element states table if the :active_power() callback is not
provided.

Subsequently, this will be used by the intel_pstate driver to register
such perf domains for CPUs on hybrid systems.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
---

v0.1 -> v0.2:
     * Do not add one more local variable to em_dev_register_perf_domain()
       and make it skip the capacity check if the :active_power() callback
       is NULL.  Add a comment explaining why it is correct to do so.

For the time being, intel_pstate will only need to be able to say
something like "this group of CPUs is preferred to that other group
of CPUs for energy-efficiency reasons regardless of the required
performance level" (and there may be multiple groups).

For this purpose, it only needs a one-element states table in a perf
domain for each of the groups and it only needs to set the cost value
for each of them to reflect the preference.

However, it may need to put CPUs of different capacity into the same
group because of favored cores and this doesn't mean that one
group will contain CPUs of different mincro-architectures.  Favored
cores essentially follow the same power-performance curve as the
other CPUs of the same type up to a certain point, but they can get
beyond that point on the curve which the other CPUs of the same type
cannot do.  Thus it would make sense to use the same states table
for favored cores and the other CPUs of the same type even if there
were multiple states in it, and em_pd_get_efficient_state() could
easily take that into account by only using the states where
"performance" is at most equal to the current CPU capacity.

Unfortunately, this doesn't match the em_create_perf_table() code
design and in particular em_init_performance() which scales the
"performance" values the the current capacity of the "domain leader"
CPU represented by "dev".

IMV, it should instead scale them to the maximum possible capacity
of the highest-capacity CPU in the domain (without requiring equal
capacity at all).

That would also help to handle the cases when CPU capacity changes,
at least on Intel platforms, if it's necessary to worry about this some
day.  Namely, when CPU capacity is reduced, say by the platform firmware
(for thermal or power distribution reasons, for example), this effectively
means chopping off the top-most section of the power-performance curve,
but the rest of it remains intact, so the same states table as before
can be used, only the current CPU capacity needs to be changed to
prevent the top-most performance states from being taken into
consideration (and analogously when CPU capacity increases).

Turbo works pretty much in the same way: When it's enabled, the
power-performance curve is effectively extended by adding some states to
it and when turbo is disabled, the top-most part of the power-performance
curve effectively gets chopped off.

This observation may help to avoid having to update the energy model for
a chip.

---
 kernel/power/energy_model.c |   23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -426,9 +426,11 @@ static int em_create_pd(struct device *d
 	if (!em_table)
 		goto free_pd;
 
-	ret = em_create_perf_table(dev, pd, em_table->state, cb, flags);
-	if (ret)
-		goto free_pd_table;
+	if (cb->active_power) {
+		ret = em_create_perf_table(dev, pd, em_table->state, cb, flags);
+		if (ret)
+			goto free_pd_table;
+	}
 
 	ret = em_compute_costs(dev, em_table->state, cb, nr_states, flags);
 	if (ret)
@@ -566,6 +568,9 @@ int em_dev_register_perf_domain(struct d
 	if (!dev || !nr_states || !cb)
 		return -EINVAL;
 
+	if (!cb->active_power && (!cb->get_cost || nr_states > 1 || microwatts))
+		return -EINVAL;
+
 	/*
 	 * Use a mutex to serialize the registration of performance domains and
 	 * let the driver-defined callback functions sleep.
@@ -594,7 +599,19 @@ int em_dev_register_perf_domain(struct d
 			 * All CPUs of a domain must have the same
 			 * micro-architecture since they all share the same
 			 * table.
+			 *
+			 * If the :active_power() callback is present, the
+			 * performance values for different states in the table
+			 * computed by em_create_perf_table() depend on the CPU
+			 * capacity which therefore must be the same for all
+			 * CPUs in the domain.  However, if the :active_power()
+			 * callback is not NULL, em_create_perf_table() doesn't
+			 * run and there is no requirement regarding CPU
+			 * capacity any more.
 			 */
+			if (!cb->active_power)
+				continue;
+
 			cap = arch_scale_cpu_capacity(cpu);
 			if (prev_cap && prev_cap != cap) {
 				dev_err(dev, "EM: CPUs of %*pbl must have the same capacity\n",