linux-kernel - Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play dead loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0j+Qc+5PtPdsDy5B0iAGWOxYbKdUOkVmL_jPNVO8fNK=g@mail.gmail.com>
Date: Wed, 30 Oct 2024 20:53:53 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Dave Hansen <dave.hansen@...el.com>
Cc: Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>, 
	Patryk Wlazlyn <patryk.wlazlyn@...ux.intel.com>, x86@...nel.org, 
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org, 
	rafael.j.wysocki@...el.com, len.brown@...el.com, dave.hansen@...ux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play
 dead loop

On Wed, Oct 30, 2024 at 8:32 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 10/30/24 02:58, Artem Bityutskiy wrote:
> > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> > 1. Could we at least set up a few rules here?  Like, say what the hints
> > are, what values can they have?
> >
> > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> > define the state.
> >
> > The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
> > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
> >
> > The hint is provided to the mwait instruction via EAX.
>
> OK, so can you distill that down to something succinct and get it in a
> comment above the new function, please?
>
> > 2. Where do they come from?
> >
> > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
> > platform defines the hint values.
> >
> > Intel typically provides the hint values in the EDS (External Design
> > Specification) document. It is typically non-public.
> >
> > Intel also discloses the hint values for open-source projects like Linux, and
> > then Intel engineers submit them to the intel_idle driver.
> >
> > Some of the hints may also be found via ACPI _CST table.
>
> What about the mwait_play_dead() loop that calculates the hint?  Doesn't
> that derive the hint from CPUID?
>
> > 3. Can this get called more than once?
> >
> > It is not supposed to. The idea is that if a driver like intel_idle is used, it
> > can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
> > for the offline code.
>
> There are two important nuggets in there:
>
> First, an idle driver can but is not required to set the hint.  This
> would be good comment material.
>
> Second, only one thing is supposed to set the hint.  This is a good
> thing to WARN() about.
>
> > 4. Does it _need_ to be set?
> >
> > No. It is more of an optimization. But it is an important optimization which may
> > result in saving a lot of money in a datacenter.
> >
> > Typically using a "wrong" hint value is non-fatal, at least I did not see it
> > being fatal so far. The CPU will map it to some hardware C-state request, but
> > which one - depends on the "wrong" value and the CPU. It just may be sub-
> > optimal.
>
> OK, so this tells me we *don't* want some kind of:
>
>         WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET);
>
> warning.
>
> > 5. What's the behavior when it is not set?
> >
> > The offline code will fall-back to the generic non-architectural algorithm,
> > which provides correct results for all server platforms I dealt with since 2017.
> > It should provide the correct hint for most client platforms, as far as I am
> > aware.
> >
> > Sierra Forest Xeon is the first platform where the generic algorithm provides a
> > sub-optimal value 0x21. It is not fatal, just sub-optimal.
>
> What is the non-architectural algorithm?  Which Linux code are you
> referring to?
>
> > Note: I am working with Intel firmware team on having the FW "re-mapping" hint
> > 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
> > the deepest C-state for an offline CPU.
>
> That would be great as well.  Thanks for doing that!
>
> > 6. Who is responsible for calling this?
> >
> > The idea for now is that the intel_idle driver calls it.
> >
> > But in theory, in the future, any driver/platform code may call it if it "knows"
> > what's the most optimal hint, I suppose. I do not have a good example though.
>
> So let's look at how this works:
>
> void native_play_dead(void)
> {
> ...
>         mwait_play_dead();
>         if (cpuidle_play_dead())
>                 hlt_play_dead();
> }
>
> This _existing_ code has three different ways of playing dead (in this
> order of preference):
>
>  1. mwait
>  2. cpuidle
>  3. hlt
>
> It has (at least) two different mechanisms for telling which of these to
> call:
>
>   1. mwait has a bunch of built-in logic that will ensure the CPU
>      should use for playing dead.  If not, it does nothing and returns.
>   2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can
>      tell) will return an error if it does not support playing dead
>
> If 1 and 2 fail, then hlt_play_dead() gets called.
>
> But the really fun part of this is that idle driver is *called* here.

Currently, cpuidle_play_dead() is for the cases when MWAIT is not supported.

> The driver that is also responsible for overriding the mwait hint.

So no, intel_idle is not called there because it only uses MWAIT.

> So this series opts to have the boot code plumb the hint back into a
> basically undocumented global variable while also assuming that the
> system is *going* to use mwait.  It then does *nothing* with the
> callback just adjacent to the code it wants to modify.
>
> Seems rather spaghetti-like to me.
>
> To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use
> mwait_play_dead() on AMD systems").  It hacks AMD-specific code in
> mwait_play_dead() just to force the cpuidle code to get called.
>
> What if we did this?  First, introduce a helper:
>
>         bool mwait_play_dead_with_hint(u32 hint)
>
> and then restructure native_play_dead() to look like this:
>
> static mwait_play_dead_generic(void)
> {
>         u32 hint = get_deepest_mwait_hint();
>
>         return mwait_play_dead_with_hint(hint);
> }
>
> void native_play_dead(void)
> {
>         bool used;
>
>         used = cpuidle_play_dead();
>         if (used)
>                 return;
>
>         used = mwait_play_dead_generic();
>         if (used)
>                 return;
>
>         hlt_play_dead();
> }
>
> If the cpuidle drivers want to use mwait with a different hint, they
> override the *EXISTING* drv->states[].enter_dead() functionality and
> call mwait_play_dead_with_hint() with their new hint.  Then they don't
> need to pass anything _over_ to the mwait code.
>
> Wouldn't something like that makes this all much more straightforward?

Well, except for one detail which is this beautiful thing in mwait_play_dead():

if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
    /*
     * Kexec is about to happen. Don't go back into mwait() as
     * the kexec kernel might overwrite text and data including
     * page tables and stack. So mwait() would resume when the
     * monitor cache line is written to and then the CPU goes
     * south due to overwritten text, page tables and stack.
     *
     * Note: This does _NOT_ protect against a stray MCE, NMI,
     * SMI. They will resume execution at the instruction
     * following the HLT instruction and run into the problem
     * which this is trying to prevent.
     */
    WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
    while(1)
         native_halt();

clearly referred to as a kexec() hack, which cannot be done in
cpuidle_play_dead() because the cpuidle driver doesn't know how to get
to md->control.

And even if it did, it is kind of not its business to deal with this stuff.