linux-kernel - Re: [RFC PATCH v4 4/8] x86/smp: Allow calling mwait_play

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0iBCKmp-Hs25chq_-z7-VB_+MqTPVmowACJkTz7KOUtEg@mail.gmail.com>
Date: Mon, 25 Nov 2024 15:39:30 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Patryk Wlazlyn <patryk.wlazlyn@...ux.intel.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org, 
	rafael.j.wysocki@...el.com, len.brown@...el.com, 
	artem.bityutskiy@...ux.intel.com, dave.hansen@...ux.intel.com, 
	peterz@...radead.org, tglx@...utronix.de, gautham.shenoy@....com
Subject: Re: [RFC PATCH v4 4/8] x86/smp: Allow calling mwait_play_dead with
 arbitrary hint

On Mon, Nov 25, 2024 at 2:21 PM Patryk Wlazlyn
<patryk.wlazlyn@...ux.intel.com> wrote:
>
> The MWAIT instruction needs different hints on different CPUs to reach
> the most specific idle states. The current hint calculation* in
> mwait_play_dead() code works in practice on current hardware, but it
> fails on a recent one, Intel's Sierra Forest and possibly some future ones.
> Those newer CPUs' power efficiency suffers when the CPU is put offline.
>
>  * The current calculation is the for loop inspecting edx in
>    mwait_play_dead()
>
> The current implementation for looking up the mwait hint for the deepest
> cstate, in mwait_play_dead() code works by inspecting CPUID leaf 0x5 and
> calculates the mwait hint based on the number of reported substates.
> This approach depends on the hints associated with them to be continuous
> in the range [0, NUM_SUBSTATES-1]. This continuity is not documented and
> is not met on the recent Intel platforms.
>
> For example, Intel's Sierra Forest report two cstates with two substates
> each in cpuid leaf 5:
>
>   Name*   target cstate    target subcstate (mwait hint)
>   ===========================================================
>   C1      0x00             0x00
>   C1E     0x00             0x01
>
>   --      0x10             ----
>
>   C6S     0x20             0x22
>   C6P     0x20             0x23
>
>   --      0x30             ----
>
>   /* No more (sub)states all the way down to the end. */
>   ===========================================================
>
>    * Names of the cstates are not included in the CPUID leaf 0x5, they are
>      taken from the product specific documentation.
>
> Notice that hints 0x20 and 0x21 are skipped entirely for the target
> cstate 0x20 (C6), being a cause of the problem for the current cpuid
> leaf 0x5 algorithm.
>
> Allow cpuidle code to provide mwait play dead loop with known, mwait
> hint for the deepest idle state on a given platform, skipping the cpuid
> based calculation.
>
> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@...ux.intel.com>

I'm going to risk saying that the changelog doesn't match the code
changes in the patch any more.

The code changes are actually relatively straightforward: The bottom
half of mwait_play_dead() is split off, so it can be called from
multiple places.

The other places from which to call it are cpuidle drivers
implementing :enter_dead() callbacks that may want to use MWAIT as the
idle state entry method.  The ACPI processor_idle driver and
intel_idle will be updated by subsequent patches to do so.

The reason for it is mostly consistency: If the cpuidle driver uses a
specific idle state for things like suspend-to-idle, it is better to
let it decide what idle state to use for "play_dead" because it may
know better.

Another reason is what mwait_play_dead() does to determine the MWAIT
argument (referred to as a "hint"), but IMO it belongs in a changelog
of a later patch because this one doesn't actually do anything about
it.  In fact, it is not expected to change the behavior of the code.

> ---
>  arch/x86/include/asm/smp.h |  3 ++
>  arch/x86/kernel/smpboot.c  | 86 ++++++++++++++++++++------------------
>  2 files changed, 49 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index ca073f40698f..d12fab4a83c5 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu);
>  int wbinvd_on_all_cpus(void);
>
>  void smp_kick_mwait_play_dead(void);
> +void mwait_play_dead_with_hint(unsigned long hint);
>
>  void native_smp_send_reschedule(int cpu);
>  void native_send_call_func_ipi(const struct cpumask *mask);
> @@ -164,6 +165,8 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
>  {
>         return (struct cpumask *)cpumask_of(0);
>  }
> +
> +static inline void mwait_play_dead_with_hint(unsigned long eax_hint) { }
>  #endif /* CONFIG_SMP */
>
>  #ifdef CONFIG_DEBUG_NMI_SELFTEST
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b5a8f0891135..d0464c7a0af5 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1272,13 +1272,57 @@ void play_dead_common(void)
>         local_irq_disable();
>  }
>
> +void __noreturn mwait_play_dead_with_hint(unsigned long eax_hint)
> +{
> +       struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
> +
> +       /* Set up state for the kexec() hack below */
> +       md->status = CPUDEAD_MWAIT_WAIT;
> +       md->control = CPUDEAD_MWAIT_WAIT;
> +
> +       wbinvd();
> +
> +       while (1) {
> +               /*
> +                * The CLFLUSH is a workaround for erratum AAI65 for
> +                * the Xeon 7400 series.  It's not clear it is actually
> +                * needed, but it should be harmless in either case.
> +                * The WBINVD is insufficient due to the spurious-wakeup
> +                * case where we return around the loop.
> +                */
> +               mb();
> +               clflush(md);
> +               mb();
> +               __monitor(md, 0, 0);
> +               mb();
> +               __mwait(eax_hint, 0);
> +
> +               if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> +                       /*
> +                        * Kexec is about to happen. Don't go back into mwait() as
> +                        * the kexec kernel might overwrite text and data including
> +                        * page tables and stack. So mwait() would resume when the
> +                        * monitor cache line is written to and then the CPU goes
> +                        * south due to overwritten text, page tables and stack.
> +                        *
> +                        * Note: This does _NOT_ protect against a stray MCE, NMI,
> +                        * SMI. They will resume execution at the instruction
> +                        * following the HLT instruction and run into the problem
> +                        * which this is trying to prevent.
> +                        */
> +                       WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> +                       while(1)
> +                               native_halt();
> +               }
> +       }
> +}
> +
>  /*
>   * We need to flush the caches before going to sleep, lest we have
>   * dirty data in our caches when we come back up.
>   */
>  static inline void mwait_play_dead(void)
>  {
> -       struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
>         unsigned int eax, ebx, ecx, edx;
>         unsigned int highest_cstate = 0;
>         unsigned int highest_subcstate = 0;
> @@ -1316,45 +1360,7 @@ static inline void mwait_play_dead(void)
>                         (highest_subcstate - 1);
>         }
>
> -       /* Set up state for the kexec() hack below */
> -       md->status = CPUDEAD_MWAIT_WAIT;
> -       md->control = CPUDEAD_MWAIT_WAIT;
> -
> -       wbinvd();
> -
> -       while (1) {
> -               /*
> -                * The CLFLUSH is a workaround for erratum AAI65 for
> -                * the Xeon 7400 series.  It's not clear it is actually
> -                * needed, but it should be harmless in either case.
> -                * The WBINVD is insufficient due to the spurious-wakeup
> -                * case where we return around the loop.
> -                */
> -               mb();
> -               clflush(md);
> -               mb();
> -               __monitor(md, 0, 0);
> -               mb();
> -               __mwait(eax, 0);
> -
> -               if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> -                       /*
> -                        * Kexec is about to happen. Don't go back into mwait() as
> -                        * the kexec kernel might overwrite text and data including
> -                        * page tables and stack. So mwait() would resume when the
> -                        * monitor cache line is written to and then the CPU goes
> -                        * south due to overwritten text, page tables and stack.
> -                        *
> -                        * Note: This does _NOT_ protect against a stray MCE, NMI,
> -                        * SMI. They will resume execution at the instruction
> -                        * following the HLT instruction and run into the problem
> -                        * which this is trying to prevent.
> -                        */
> -                       WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> -                       while(1)
> -                               native_halt();
> -               }
> -       }
> +       mwait_play_dead_with_hint(eax);
>  }
>
>  /*
> --
> 2.47.0
>
>