linux-kernel - Re: [PATCH] x86/microcode: Add an option to reload microcode even if revision is unchanged

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b0b0a1c0-979d-cc54-38e2-d37522aab351@oracle.com>
Date:   Thu, 19 Sep 2019 22:48:00 +0300
From:   Mihai Carabas <mihai.carabas@...cle.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Johannes Erdfelt <johannes@...felt.com>
Cc:     "Raj, Ashok" <ashok.raj@...el.com>, Borislav Petkov <bp@...en8.de>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
        Jon Grimm <Jon.Grimm@....com>, kanth.ghatraju@...cle.com,
        konrad.wilk@...cle.com, patrick.colp@...cle.com,
        Tom Lendacky <thomas.lendacky@....com>,
        x86-ml <x86@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/microcode: Add an option to reload microcode even if
 revision is unchanged

La 07.09.2019 00:16, Thomas Gleixner a scris:
> On Fri, 6 Sep 2019, Johannes Erdfelt wrote:
>> On Fri, Sep 06, 2019, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> What your customers are asking for is a receipe for disaster. They can
>>> check the safety of late loading forever, it will not magically become safe
>>> because they do so.
>>>
>>> If you want late loading, then the whole approach needs to be reworked from
>>> ground up. You need to make sure that all CPUs are in a safe state,
>>> i.e. where switching of CPU feature bits of all sorts can be done with the
>>> guarantee that no CPU will return to the wrong code path after coming out
>>> of safe state and that any kernel internal state which depends on the
>>> previous set of CPU feature bits has been mopped up and switched over
>>> before CPUs are released.
>>
>> You say that switching of CPU feature bits is problematic, but adding
>> new features should result only in a warning ("x86/CPU: CPU features
>> have changed after loading microcode, but might not take effect.").
>>
>> Removing a CPU feature bit could be problematic. Other than HLE being
>> removed on Haswell (which the kernel shouldn't use anyway), have there
>> been any other cases?
>>
>> I ask because we have successfully used late microcode loading on tens
>> of thousands of hosts. I'm a bit worried to see that there is a push to
>> remove a feature that we currently rely on.
> 
> The point is that you know what's on stake so you can evaluate precisely
> upfront whether that works or not and you have experienced kernel engineers
> on staff who can tell you which kind of ucode change is going to explode in
> your face and which on does not.
> 
> So it's the special case of a large cloud company with experts on staff.
> 
> Now map that to the average user/sysadmin. If we proliferate this, then the
> inevitable consequence will be that those people read about how great that
> is and how it made your customers happy yadayadayada. Now they go and do
> the same thing and guess what happens? It explodes in their face, they send
> bug reports and someone else will send lousy patches to paper over the
> problem. None of this ends on your desk.
> 
> Yes you can surely argue that if you give people a gun then they can shoot
> themself into their foot. But in that case it's a irresponsible argument
> which just put's your interest above the general rule of not offering
> things which are bound to break in all flavours of wreckage especially in
> the hard to diagnose way.
> 
> So if we want to do late microcode loading in a sane way then there are
> only a few options and none of them exist today:
> 
>   1) Micro-code contains a description of CPUID bits which are going to be
>      exposed after the load. Then the kernel can sanity check whether this
>      changes anything relevant or not. If there is a relevant change it can
>      reject the load and tell the admin that a reboot is required.
> 
>   2) Rework CPUID feature handling so that it can reevaluate and reconfigure
>      the running system safely. There are a lot of things you need for that:
> 
>      A) Introduce a safe state for CPUs to reach which guarantees that none
>         of the CPUs will return from that state via a code path which
>         depends on previous state and might now go the other route with data
>         on the stack which only fits the previous configuration.
> 
>      B) Make all the cpufeature thingies run time switchable. That means
>         that you need to keep quite some code around which is currently init
>         only. That also means that you have to provide backout code for
>         things which set up data corresponding to cpu feature bits and so
>         forth.
> 
> So #2 might be finished in about 20 years from now with the result that
> some of the code pathes might simply still have a
> 
>       if (cpufeature_changed())
>       	   panic();
> 
> because there are things which you cannot back out. So the only sane
> solution is to panic. Which is not a solution as it would be much more sane
> to prevent late loading upfront and force people to reboot proper.
> 
> Now #1 is actually a sensible and feasible solution which can be pulled off
> in a reasonably short time frame, avoids all the bound to be ugly and
> failure laden attempts of fixing late loading completely and provides a
> usable and safe solution for joe user, jack admin and the super experts at
> big-cloud corporate.
> 
> That is not requiring any new format of microcode payload, as this can be
> nicely done as a metadata package which comes with the microcode
> payload. So you get the following backwards compatible states:
> 
>    Kernel  metadata	  result
> 
>    old	  don't care	  refuse late load
> 
>    new	  No   		  refuse late load
> 
>    new	  Yes		  decide based on metadata
> 
> Thoughts?


Internally, we have fix-up multiple corner cases about the late 
microcode loading. We have written some code to handle new features 
showing up but we know they are a bunch of hacks (for sure it lacks of 
different checks that needs to be done before using the new features). I 
am going to take Thomas' suggestion and work on an RFC series.

Thank you,
Mihai Carabas