lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <000fd68e-2b24-4eb3-b2d7-e4856b403212@intel.com>
Date: Wed, 30 Oct 2024 12:32:34 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>,
 Patryk Wlazlyn <patryk.wlazlyn@...ux.intel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
 rafael.j.wysocki@...el.com, len.brown@...el.com, dave.hansen@...ux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play
 dead loop

On 10/30/24 02:58, Artem Bityutskiy wrote:
> On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> 1. Could we at least set up a few rules here?  Like, say what the hints
> are, what values can they have?
> 
> The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> define the state.
> 
> The state value (higher 4 bits) correspond to the state enumerated by CPUID leaf
> 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque number.
> 
> The hint is provided to the mwait instruction via EAX.

OK, so can you distill that down to something succinct and get it in a
comment above the new function, please?

> 2. Where do they come from?
> 
> Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The specific
> platform defines the hint values.
> 
> Intel typically provides the hint values in the EDS (External Design
> Specification) document. It is typically non-public.
> 
> Intel also discloses the hint values for open-source projects like Linux, and
> then Intel engineers submit them to the intel_idle driver.
> 
> Some of the hints may also be found via ACPI _CST table.

What about the mwait_play_dead() loop that calculates the hint?  Doesn't
that derive the hint from CPUID?

> 3. Can this get called more than once?
> 
> It is not supposed to. The idea is that if a driver like intel_idle is used, it
> can call 'set_mwait_play_dead_hint()' and provide the most optimal hint number
> for the offline code.

There are two important nuggets in there:

First, an idle driver can but is not required to set the hint.  This
would be good comment material.

Second, only one thing is supposed to set the hint.  This is a good
thing to WARN() about.

> 4. Does it _need_ to be set?
> 
> No. It is more of an optimization. But it is an important optimization which may
> result in saving a lot of money in a datacenter.
> 
> Typically using a "wrong" hint value is non-fatal, at least I did not see it
> being fatal so far. The CPU will map it to some hardware C-state request, but
> which one - depends on the "wrong" value and the CPU. It just may be sub-
> optimal.

OK, so this tells me we *don't* want some kind of:

	WARN_ON(play_dead_mwait_hint == PLAY_DEAD_MWAIT_HINT_UNSET);

warning.

> 5. What's the behavior when it is not set?
> 
> The offline code will fall-back to the generic non-architectural algorithm,
> which provides correct results for all server platforms I dealt with since 2017.
> It should provide the correct hint for most client platforms, as far as I am
> aware.
> 
> Sierra Forest Xeon is the first platform where the generic algorithm provides a
> sub-optimal value 0x21. It is not fatal, just sub-optimal.

What is the non-architectural algorithm?  Which Linux code are you
referring to?

> Note: I am working with Intel firmware team on having the FW "re-mapping" hint
> 0x21 to hint 0x23, so that "unaware" Linux kernel also ends up with requesting
> the deepest C-state for an offline CPU.

That would be great as well.  Thanks for doing that!

> 6. Who is responsible for calling this?
> 
> The idea for now is that the intel_idle driver calls it.
> 
> But in theory, in the future, any driver/platform code may call it if it "knows"
> what's the most optimal hint, I suppose. I do not have a good example though.

So let's look at how this works:

void native_play_dead(void)
{
...
        mwait_play_dead();
        if (cpuidle_play_dead())
                hlt_play_dead();
}

This _existing_ code has three different ways of playing dead (in this
order of preference):

 1. mwait
 2. cpuidle
 3. hlt

It has (at least) two different mechanisms for telling which of these to
call:

  1. mwait has a bunch of built-in logic that will ensure the CPU
     should use for playing dead.  If not, it does nothing and returns.
  2. cpuidle_play_dead() (only used by acpi_idle_driver as far as I can
     tell) will return an error if it does not support playing dead

If 1 and 2 fail, then hlt_play_dead() gets called.

But the really fun part of this is that idle driver is *called* here.
The driver that is also responsible for overriding the mwait hint.

So this series opts to have the boot code plumb the hint back into a
basically undocumented global variable while also assuming that the
system is *going* to use mwait.  It then does *nothing* with the
callback just adjacent to the code it wants to modify.

Seems rather spaghetti-like to me.

To make it worse, go look at da6fa7ef67f0 ("x86/smpboot: Don't use
mwait_play_dead() on AMD systems").  It hacks AMD-specific code in
mwait_play_dead() just to force the cpuidle code to get called.

What if we did this?  First, introduce a helper:

	bool mwait_play_dead_with_hint(u32 hint)

and then restructure native_play_dead() to look like this:

static mwait_play_dead_generic(void)
{
	u32 hint = get_deepest_mwait_hint();

	return mwait_play_dead_with_hint(hint);
}

void native_play_dead(void)
{
	bool used;

	used = cpuidle_play_dead();
	if (used)
		return;

	used = mwait_play_dead_generic();
	if (used)
		return;

	hlt_play_dead();
}

If the cpuidle drivers want to use mwait with a different hint, they
override the *EXISTING* drv->states[].enter_dead() functionality and
call mwait_play_dead_with_hint() with their new hint.  Then they don't
need to pass anything _over_ to the mwait code.

Wouldn't something like that makes this all much more straightforward?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ