[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9272406507bca4eb284cd75759d8d9479b937da9.camel@linux.intel.com>
Date: Wed, 06 Nov 2024 10:14:30 +0200
From: Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>, Patryk Wlazlyn
<patryk.wlazlyn@...ux.intel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
rafael.j.wysocki@...el.com, len.brown@...el.com, dave.hansen@...ux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play
dead loop
Hi Dave, community,
I'll cover general questions, and let Patryk cover the specific core-related
comments.
On Wed, 2024-10-30 at 12:32 -0700, Dave Hansen wrote:
> On 10/30/24 02:58, Artem Bityutskiy wrote:
> > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> > 1. Could we at least set up a few rules here? Like, say what the hints
> > are, what values can they have?
> >
> > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> > define the state.
> >
> > The state value (higher 4 bits) correspond to the state enumerated by CPUID
> > leaf
> > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque
> > number.
> >
> > The hint is provided to the mwait instruction via EAX.
>
> OK, so can you distill that down to something succinct and get it in a
> comment above the new function, please?
Yes, that was my idea to write a long e-mail to cover your and other reviewer's
questions, and let Patryk turn this into nice comments at the right places.
> > 2. Where do they come from?
> >
> > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The
> > specific
> > platform defines the hint values.
> >
> > Intel typically provides the hint values in the EDS (External Design
> > Specification) document. It is typically non-public.
> >
> > Intel also discloses the hint values for open-source projects like Linux,
> > and
> > then Intel engineers submit them to the intel_idle driver.
> >
> > Some of the hints may also be found via ACPI _CST table.
>
> What about the mwait_play_dead() loop that calculates the hint? Doesn't
> that derive the hint from CPUID?
The mwait_play_dead() hint calculation algorithm is the root of the problem. It
calculates a sub-optimal hint for some platforms, such as Sierra Forest Xeon.
So at high level, mwait_play_dead() calculates the hint, saves it in the 'eax'
variable, and then just eeps issuing 'mwait(eax, ecx=0)' with that hint.
So the mwait hint calculation code is between lines [1303] and [1313] (see links
below).
The code basically finds the highest C-state index ('highest_cstate' variable)
and count of sub-states within that C-state ('highest_subcstate' variable), and
forms the mwait hint out of them.
You should also check leaf CPUID Leaf 5 SDM doc for more clarity.
Example #1.
==========
Sapphire Rapids Xeon, where the algorithm calculates the correct hint.
'highest_cstate' ends up being 2 (corresponds C3 in CPUID leaf 5 terms, note
that C0 is ignored), and highest_subcstate ends up being 1 (1 sub-state), so the
mwait hint ends up being 0x20. The math is at line [1311]:
eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) | (highest_subcstate - 1) =
(2 << 4) | (1 - 1) = 0x20
This ends up being the deepest available C-state hint (C6). No problem here.
Code lines.
1303:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1303
1313:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1313
1311:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1311
Exemple #2.
==========
Sierra Forest Xeon, where the algorithm calculates the incorrect hint.
Here is CPUID leaf 5.
$ cpuid -1 -l 5
CPU:
MONITOR/MWAIT (5):
smallest monitor-line size (bytes) = 0x40 (64)
largest monitor-line size (bytes) = 0x40 (64)
enum of Monitor-MWAIT exts supported = true
supports intrs as break-event for MWAIT = true
number of C0 sub C-states using MWAIT = 0x0 (0)
number of C1 sub C-states using MWAIT = 0x2 (2)
number of C2 sub C-states using MWAIT = 0x0 (0)
number of C3 sub C-states using MWAIT = 0x2 (2)
number of C4 sub C-states using MWAIT = 0x0 (0)
number of C5 sub C-states using MWAIT = 0x0 (0)
number of C6 sub C-states using MWAIT = 0x0 (0)
number of C7 sub C-states using MWAIT = 0x0 (0)
Notice that CPUID leaf 5 advertises two C3 sub-states, and with this input the
the algorithm ends up with mwait hint 0x21. This mwait hint is not documented as
supported on Sierra Forest.
The supported Sierra Forest mwait hints are defined in 'intel_idle.c':
https://github.com/torvalds/linux/blob/2e1b3cc9d7f790145a80cb705b168f05dab65df2/drivers/idle/intel_idle.c#L1345
They are 0x0, 0x01, 0x22, 0x23. And 0x23 is the deepest C-state (C6SP).
But offline code ends up issuing 0x21. The platform maps it to a "shallow
version" of core C6, which is not documented and is not supposed to be used.
At any rate, it ends up in a situation when offlining a single CPU will prevent
the platform from ever reaching PC6 state. In PC6 - happens when all CPUs
requested C6SP - all cores are powered off and there are power savings in uncore
(caches, interconnects, memory controller, etc).
Now, as I mentioned, in parallel I am working with Sierra Forest firmware team
to change platform behavior and map hint 0x21 to 0x23 in firmware. As a
workaround.
>
> > 5. What's the behavior when it is not set?
> >
> > The offline code will fall-back to the generic non-architectural algorithm,
> > which provides correct results for all server platforms I dealt with since
> > 2017.
> > It should provide the correct hint for most client platforms, as far as I am
> > aware.
> >
> > Sierra Forest Xeon is the first platform where the generic algorithm
> > provides a
> > sub-optimal value 0x21. It is not fatal, just sub-optimal.
>
> What is the non-architectural algorithm? Which Linux code are you
> referring to?
See above.
The non-architectural part of it is basically this.
The algorithm assumes that in the sub-state part of the mwait hint (4 least
significant bits) always starts with 0 and then next sub-state is always 1, and
there are no gaps. This is not documented in Intel SDM. This happens to work on
majority of Intel platforms, but this is not architectural.
Thank you,
Artem.
Powered by blists - more mailing lists