[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b0b0a1c0-979d-cc54-38e2-d37522aab351@oracle.com>
Date: Thu, 19 Sep 2019 22:48:00 +0300
From: Mihai Carabas <mihai.carabas@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>,
Johannes Erdfelt <johannes@...felt.com>
Cc: "Raj, Ashok" <ashok.raj@...el.com>, Borislav Petkov <bp@...en8.de>,
Boris Ostrovsky <boris.ostrovsky@...cle.com>,
"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
Jon Grimm <Jon.Grimm@....com>, kanth.ghatraju@...cle.com,
konrad.wilk@...cle.com, patrick.colp@...cle.com,
Tom Lendacky <thomas.lendacky@....com>,
x86-ml <x86@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/microcode: Add an option to reload microcode even if
revision is unchanged
La 07.09.2019 00:16, Thomas Gleixner a scris:
> On Fri, 6 Sep 2019, Johannes Erdfelt wrote:
>> On Fri, Sep 06, 2019, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> What your customers are asking for is a receipe for disaster. They can
>>> check the safety of late loading forever, it will not magically become safe
>>> because they do so.
>>>
>>> If you want late loading, then the whole approach needs to be reworked from
>>> ground up. You need to make sure that all CPUs are in a safe state,
>>> i.e. where switching of CPU feature bits of all sorts can be done with the
>>> guarantee that no CPU will return to the wrong code path after coming out
>>> of safe state and that any kernel internal state which depends on the
>>> previous set of CPU feature bits has been mopped up and switched over
>>> before CPUs are released.
>>
>> You say that switching of CPU feature bits is problematic, but adding
>> new features should result only in a warning ("x86/CPU: CPU features
>> have changed after loading microcode, but might not take effect.").
>>
>> Removing a CPU feature bit could be problematic. Other than HLE being
>> removed on Haswell (which the kernel shouldn't use anyway), have there
>> been any other cases?
>>
>> I ask because we have successfully used late microcode loading on tens
>> of thousands of hosts. I'm a bit worried to see that there is a push to
>> remove a feature that we currently rely on.
>
> The point is that you know what's on stake so you can evaluate precisely
> upfront whether that works or not and you have experienced kernel engineers
> on staff who can tell you which kind of ucode change is going to explode in
> your face and which on does not.
>
> So it's the special case of a large cloud company with experts on staff.
>
> Now map that to the average user/sysadmin. If we proliferate this, then the
> inevitable consequence will be that those people read about how great that
> is and how it made your customers happy yadayadayada. Now they go and do
> the same thing and guess what happens? It explodes in their face, they send
> bug reports and someone else will send lousy patches to paper over the
> problem. None of this ends on your desk.
>
> Yes you can surely argue that if you give people a gun then they can shoot
> themself into their foot. But in that case it's a irresponsible argument
> which just put's your interest above the general rule of not offering
> things which are bound to break in all flavours of wreckage especially in
> the hard to diagnose way.
>
> So if we want to do late microcode loading in a sane way then there are
> only a few options and none of them exist today:
>
> 1) Micro-code contains a description of CPUID bits which are going to be
> exposed after the load. Then the kernel can sanity check whether this
> changes anything relevant or not. If there is a relevant change it can
> reject the load and tell the admin that a reboot is required.
>
> 2) Rework CPUID feature handling so that it can reevaluate and reconfigure
> the running system safely. There are a lot of things you need for that:
>
> A) Introduce a safe state for CPUs to reach which guarantees that none
> of the CPUs will return from that state via a code path which
> depends on previous state and might now go the other route with data
> on the stack which only fits the previous configuration.
>
> B) Make all the cpufeature thingies run time switchable. That means
> that you need to keep quite some code around which is currently init
> only. That also means that you have to provide backout code for
> things which set up data corresponding to cpu feature bits and so
> forth.
>
> So #2 might be finished in about 20 years from now with the result that
> some of the code pathes might simply still have a
>
> if (cpufeature_changed())
> panic();
>
> because there are things which you cannot back out. So the only sane
> solution is to panic. Which is not a solution as it would be much more sane
> to prevent late loading upfront and force people to reboot proper.
>
> Now #1 is actually a sensible and feasible solution which can be pulled off
> in a reasonably short time frame, avoids all the bound to be ugly and
> failure laden attempts of fixing late loading completely and provides a
> usable and safe solution for joe user, jack admin and the super experts at
> big-cloud corporate.
>
> That is not requiring any new format of microcode payload, as this can be
> nicely done as a metadata package which comes with the microcode
> payload. So you get the following backwards compatible states:
>
> Kernel metadata result
>
> old don't care refuse late load
>
> new No refuse late load
>
> new Yes decide based on metadata
>
> Thoughts?
Internally, we have fix-up multiple corner cases about the late
microcode loading. We have written some code to handle new features
showing up but we know they are a bunch of hacks (for sure it lacks of
different checks that needs to be done before using the new features). I
am going to take Thomas' suggestion and work on an RFC series.
Thank you,
Mihai Carabas
Powered by blists - more mailing lists