linux-kernel - Re: [PATCH 0/6] x86/microcode: Support for Intel Staging Feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <54308373-7867-4b76-be34-63730953f83c@intel.com>
Date: Wed, 2 Apr 2025 10:14:09 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Colin Mitchell <colinmitchell@...gle.com>
Cc: bp@...en8.de, chang.seok.bae@...el.com, dave.hansen@...ux.intel.com,
 linux-kernel@...r.kernel.org, mingo@...hat.com, tglx@...utronix.de,
 x86@...nel.org
Subject: Re: [PATCH 0/6] x86/microcode: Support for Intel Staging Feature

On 3/26/25 14:29, Colin Mitchell wrote:
>> On 2/28/25 15:23, Dave Hansen wrote:
>> You seem to be saying that you'd rather be (for instance) insecure
>> running old microcode than have the latency blip from a legacy microcode
>> load.
>> What action would you take if a staging-load fails? Retry again a few
>> times? Go back to the CPU vendor and get a new image? Or just ignore it?
> That's correct, but the latency tradeoff scales with the platform specific
> size of the microcode patch. I'd prefer to have a more deterministic
> update path and believe the potential latency blip would be significant
> enough to justify the option.
> 
> Adding configuration would allow me to handle the error as needed.
> A retry loop would be a first step but I could also look to migrate VMs
> off the machine if the platform specific latency blip would negatively 
> affect sensitive guest VMs. While an ideal solution imo would then
> allow me to force legacy loading, I could also settle with it being done
> through a reboot where early boot would already skip staging.

There's a lot to unpack there.

But, for the purposes of this series, I think what's here is fine for
now. Let's leave staging _purely_ as an opportunistic optimization.

If folks want to make this more configurable like making staging
*mandatory* and disabling legacy loading then we'll look at the patches
(and their justifications) as folks submit them. A good justification
would be something along these lines:

	Legacy microcode loading causes a 5,000ms latency blip. Our
	customers have been complaining to us for years about those
	legacy loading blips. Migrating a VM causes a 1ms latency blip.
	Those 4,999ms mean a lot to the folks running those VMs. As a
	CSP, we would like the flexibility to avoid the gigantic legacy
	microcode loading blips because they are bad and getting worse.

It becomes less compelling if it's something like this:

	Legacy microcode loading causes a 50ms latency blip. Migrating a
	VM causes a 49ms latency blip. That millisecond is super
	important.

... and increasingly less so as it becomes:

	We like knobs and flexibility for $REASONS.

You don't have to have down-to-the-millisecond numbers here. Orders of
magnitude are fine. But if you can't demonstrate (or don't anticipate)
orders of magnitude improvement from the knob, then it's probably not
worth it. It better be a 10x improvement, not 10%.