[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35946efe3b8b8b686ba4ea0ed5c9f15c50ca6ef8.camel@linux.intel.com>
Date: Wed, 30 Oct 2024 11:58:12 +0200
From: Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>, Patryk Wlazlyn
<patryk.wlazlyn@...ux.intel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
rafael.j.wysocki@...el.com, len.brown@...el.com, dave.hansen@...ux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play
dead loop
On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> On 10/29/24 03:15, Patryk Wlazlyn wrote:
> > +void smp_set_mwait_play_dead_hint(unsigned int hint)
> > +{
> > + WRITE_ONCE(play_dead_mwait_hint, hint);
> > +}
>
> This all feels a bit hacky and unstructured to me.
>
> Could we at least set up a few rules here? Like, say what the hints
> are, what values can they have? Where do they come from? Can this get
> called more than once? Does it _need_ to be set? What's the behavior
> when it is not set? Who is responsible for calling this?
>
> What good does the smp_ prefix do? I don't think _callers_ care whether
> this is getting optimized out or not.
>
The goal of 'get_deepest_mwait_hint()' is to find the mwait hint of the deepest
available C-state, in order to request it for the offline CPU. On Intel CPUs,
the C-states and their mwait hint values are platform-specific.
Generally, there is no architectural way for enumerating mwait hints on Intel
CPUs. In the idle path (different to the CPU offline path), idle drivers (if
enabled) enumerate and request C-states using either ACPI mechanisms or a
compiled-in, per-platform custom C-states table, provided by Intel for specific
platforms.
In the CPU offline path, only the deepest C-state hint is needed. Historically,
it was determined using a simple algorithm, which happened to provide the
correct result on most Intel platforms. This algorithm is based on scanning
CPUID leaf 5 EDX bits and building the hint value from the C-state and sub-state
numbers.
Generally speaking, mwait hints are opaque numbers, and the algorithm is not
architectural. While it produces the correct results for most Intel CPUs, it
produces sub-optimal result for some CPUs. For example Intel Sierra Forest Xeon
CPU, the algorithm produces hint 0x21, while the actual deepest C-state hint is
0x23. If hint 0x21 is used, the result is that the offline CPU does not enter
the deepest available C-state. While this is not fatal, the CPU ends up saving
less energy than it could have saved.
The 'set_mwait_play_dead_hint()' function provides a mechanism for defining the
mwait hint for the offline CPU, and can be used for platforms where the generic
non-architectural algorithm provides a sub-optimal result.
Q&A.
1. Could we at least set up a few rules here? Like, say what the hints
are, what values can they have?
The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
define the state.
The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
The hint is provided to the mwait instruction via EAX.
2. Where do they come from?
Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
platform defines the hint values.
Intel typically provides the hint values in the EDS (External Design
Specification) document. It is typically non-public.
Intel also discloses the hint values for open-source projects like Linux, and
then Intel engineers submit them to the intel_idle driver.
Some of the hints may also be found via ACPI _CST table.
3. Can this get called more than once?
It is not supposed to. The idea is that if a driver like intel_idle is used, it
can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
for the offline code.
4. Does it _need_ to be set?
No. It is more of an optimization. But it is an important optimization which may
result in saving a lot of money in a datacenter.
Typically using a "wrong" hint value is non-fatal, at least I did not see it
being fatal so far. The CPU will map it to some hardware C-state request, but
which one - depends on the "wrong" value and the CPU. It just may be sub-
optimal.
5. What's the behavior when it is not set?
The offline code will fall-back to the generic non-architectural algorithm,
which provides correct results for all server platforms I dealt with since 2017.
It should provide the correct hint for most client platforms, as far as I am
aware.
Sierra Forest Xeon is the first platform where the generic algorithm provides a
sub-optimal value 0x21. It is not fatal, just sub-optimal.
Note: I am working with Intel firmware team on having the FW "re-mapping" hint
0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
the deepest C-state for an offline CPU.
6. Who is responsible for calling this?
The idea for now is that the intel_idle driver calls it.
But in theory, in the future, any driver/platform code may call it if it "knows"
what's the most optimal hint, I suppose. I do not have a good example though.
Artem.
Powered by blists - more mailing lists