[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <000fd68e-2b24-4eb3-b2d7-e4856b403212@intel.com>
Date: Wed, 30 Oct 2024 12:32:34 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
Patryk Wlazlyn <patryk.wlazlyn@...ux.intel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
rafael.j.wysocki@...el.com, len.brown@...el.com, dave.hansen@...ux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play
dead loop
On 10/30/24 02:58, Artem Bityutskiy wrote:
> On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> 1. Could we at least set up a few rules here? Like, say what the hints
> are, what values can they have?
>
> The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> define the state.
>
> The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
> 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
>
> The hint is provided to the mwait instruction via EAX.
OK, so can you distill that down to something succinct and get it in a
comment above the new function, please?
> 2. Where do they come from?
>
> Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
> platform defines the hint values.
>
> Intel typically provides the hint values in the EDS (External Design
> Specification) document. It is typically non-public.
>
> Intel also discloses the hint values for open-source projects like Linux, and
> then Intel engineers submit them to the intel_idle driver.
>
> Some of the hints may also be found via ACPI _CST table.
What about the mwait_play_dead() loop that calculates the hint? Doesn't
that derive the hint from CPUID?
> 3. Can this get called more than once?
>
> It is not supposed to. The idea is that if a driver like intel_idle is used, it
> can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
> for the offline code.
There are two important nuggets in there:
First, an idle driver can but is not required to set the hint. This
would be good comment material.
Second, only one thing is supposed to set the hint. This is a good
thing to WARN() about.
> 4. Does it _need_ to be set?
>
> No. It is more of an optimization. But it is an important optimization which may
> result in saving a lot of money in a datacenter.
>
> Typically using a "wrong" hint value is non-fatal, at least I did not see it
> being fatal so far. The CPU will map it to some hardware C-state request, but
> which one - depends on the "wrong" value and the CPU. It just may be sub-
> optimal.
OK, so this tells me we *don't* want some kind of:
WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET);
warning.
> 5. What's the behavior when it is not set?
>
> The offline code will fall-back to the generic non-architectural algorithm,
> which provides correct results for all server platforms I dealt with since 2017.
> It should provide the correct hint for most client platforms, as far as I am
> aware.
>
> Sierra Forest Xeon is the first platform where the generic algorithm provides a
> sub-optimal value 0x21. It is not fatal, just sub-optimal.
What is the non-architectural algorithm? Which Linux code are you
referring to?
> Note: I am working with Intel firmware team on having the FW "re-mapping" hint
> 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
> the deepest C-state for an offline CPU.
That would be great as well. Thanks for doing that!
> 6. Who is responsible for calling this?
>
> The idea for now is that the intel_idle driver calls it.
>
> But in theory, in the future, any driver/platform code may call it if it "knows"
> what's the most optimal hint, I suppose. I do not have a good example though.
So let's look at how this works:
void native_play_dead(void)
{
...
mwait_play_dead();
if (cpuidle_play_dead())
hlt_play_dead();
}
This _existing_ code has three different ways of playing dead (in this
order of preference):
1. mwait
2. cpuidle
3. hlt
It has (at least) two different mechanisms for telling which of these to
call:
1. mwait has a bunch of built-in logic that will ensure the CPU
should use for playing dead. If not, it does nothing and returns.
2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can
tell) will return an error if it does not support playing dead
If 1 and 2 fail, then hlt_play_dead() gets called.
But the really fun part of this is that idle driver is *called* here.
The driver that is also responsible for overriding the mwait hint.
So this series opts to have the boot code plumb the hint back into a
basically undocumented global variable while also assuming that the
system is *going* to use mwait. It then does *nothing* with the
callback just adjacent to the code it wants to modify.
Seems rather spaghetti-like to me.
To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use
mwait_play_dead() on AMD systems"). It hacks AMD-specific code in
mwait_play_dead() just to force the cpuidle code to get called.
What if we did this? First, introduce a helper:
bool mwait_play_dead_with_hint(u32 hint)
and then restructure native_play_dead() to look like this:
static mwait_play_dead_generic(void)
{
u32 hint = get_deepest_mwait_hint();
return mwait_play_dead_with_hint(hint);
}
void native_play_dead(void)
{
bool used;
used = cpuidle_play_dead();
if (used)
return;
used = mwait_play_dead_generic();
if (used)
return;
hlt_play_dead();
}
If the cpuidle drivers want to use mwait with a different hint, they
override the *EXISTING* drv->states[].enter_dead() functionality and
call mwait_play_dead_with_hint() with their new hint. Then they don't
need to pass anything _over_ to the mwait code.
Wouldn't something like that makes this all much more straightforward?
Powered by blists - more mailing lists