linux-kernel - Re: [PATCH] PM: EM: Fix late boot with holes in CPU topology

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0idnFDYviDBusv8hvFD+yH71kL=Q_ARpn5cUBbAg838RQ@mail.gmail.com>
Date: Mon, 1 Sep 2025 18:58:31 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Christian Loehle <christian.loehle@....com>
Cc: rafael@...nel.org, lukasz.luba@....com, linux-pm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, dietmar.eggemann@....com, 
	kenneth.crudup@...il.com, stable@...r.kernel.org
Subject: Re: [PATCH] PM: EM: Fix late boot with holes in CPU topology

On Sun, Aug 31, 2025 at 11:44 PM Christian Loehle
<christian.loehle@....com> wrote:
>
> commit e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity
> adjustment") added a mechanism to handle CPUs that come up late by
> retrying when any of the `cpufreq_cpu_get()` call fails.
>
> However, if there are holes in the CPU topology (offline CPUs, e.g.
> nosmt), the first missing CPU causes the loop to break, preventing
> subsequent online CPUs from being updated.
> Instead of aborting on the first missing CPU policy, loop through all
> and retry if any were missing.
>
> Fixes: e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity adjustment")
> Suggested-by: Kenneth Crudup <kenneth.crudup@...il.com>
> Reported-by: Kenneth Crudup <kenneth.crudup@...il.com>
> Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/
> Cc: stable@...r.kernel.org
> Signed-off-by: Christian Loehle <christian.loehle@....com>
> ---
>  kernel/power/energy_model.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index ea7995a25780..b63c2afc1379 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -778,7 +778,7 @@ void em_adjust_cpu_capacity(unsigned int cpu)
>  static void em_check_capacity_update(void)
>  {
>         cpumask_var_t cpu_done_mask;
> -       int cpu;
> +       int cpu, failed_cpus = 0;
>
>         if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
>                 pr_warn("no free memory\n");
> @@ -796,10 +796,8 @@ static void em_check_capacity_update(void)
>
>                 policy = cpufreq_cpu_get(cpu);
>                 if (!policy) {
> -                       pr_debug("Accessing cpu%d policy failed\n", cpu);

I'm still quite unsure why you want to stop printing this message.  It
is kind of useful to know which policies have had to be retried, while
printing the number of them really isn't particularly useful.  And
this is pr_debug(), so user selectable anyway.

So I'm inclined to retain the line above and drop the new pr_debug() below.

Please let me know if this is a problem.

> -                       schedule_delayed_work(&em_update_work,
> -                                             msecs_to_jiffies(1000));
> -                       break;
> +                       failed_cpus++;
> +                       continue;
>                 }
>                 cpufreq_cpu_put(policy);
>
> @@ -814,6 +812,11 @@ static void em_check_capacity_update(void)
>                 em_adjust_new_capacity(cpu, dev, pd);
>         }
>
> +       if (failed_cpus) {
> +               pr_debug("Accessing %d policies failed, retrying\n", failed_cpus);
> +               schedule_delayed_work(&em_update_work, msecs_to_jiffies(1000));
> +       }
> +
>         free_cpumask_var(cpu_done_mask);
>  }
>
> --