[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <6bd12fe9-9f95-f995-bc21-f292246a59c6@de.ibm.com>
Date: Tue, 30 Jan 2018 16:33:30 +0100
From: Christian Borntraeger <borntraeger@...ibm.com>
To: Christophe de Dinechin <christophe.de.dinechin@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
David Woodhouse <dwmw2@...radead.org>,
Arjan van de Ven <arjan@...ux.intel.com>,
Eduardo Habkost <ehabkost@...hat.com>,
KarimAllah Ahmed <karahmed@...zon.de>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andi Kleen <ak@...ux.intel.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Andy Lutomirski <luto@...nel.org>,
Ashok Raj <ashok.raj@...el.com>,
Asit Mallick <asit.k.mallick@...el.com>,
Borislav Petkov <bp@...e.de>,
Dan Williams <dan.j.williams@...el.com>,
Dave Hansen <dave.hansen@...el.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"H . Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
Janakarajan Natarajan <Janakarajan.Natarajan@....com>,
Joerg Roedel <joro@...tes.org>,
Jun Nakajima <jun.nakajima@...el.com>,
Laura Abbott <labbott@...hat.com>,
Masami Hiramatsu <mhiramat@...nel.org>,
Paolo Bonzini <pbonzini@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Radim Krčmář <rkrcmar@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Tom Lendacky <thomas.lendacky@....com>,
KVM list <kvm@...r.kernel.org>,
the arch/x86 maintainers <x86@...nel.org>,
"Dr. David Alan Gilbert" <dgilbert@...hat.com>
Subject: Re: [RFC,05/10] x86/speculation: Add basic IBRS support
infrastructure
On 01/30/2018 03:56 PM, Christophe de Dinechin wrote:
>
>
>> On 30 Jan 2018, at 15:52, Christian Borntraeger <borntraeger@...ibm.com> wrote:
>>
>>
>>
>> On 01/30/2018 03:46 PM, Christophe de Dinechin wrote:
>>>
>>>
>>>> On 30 Jan 2018, at 13:11, Christian Borntraeger <borntraeger@...ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 01/30/2018 01:23 AM, Linus Torvalds wrote:
>>>> [...]
>>>>>
>>>>> So I actually have a _different_ question to the virtualization
>>>>> people. This includes the vmware people, but it also obviously
>>>>> incldues the Amazon AWS kind of usage.
>>>>>
>>>>> When you're a hypervisor (whether vmware or Amazon), why do you even
>>>>> end up caring about these things so much? You're protected from
>>>>> meltdown thanks to the virtual environment already having separate
>>>>> page tables. And the "big hammer" approach to spectre would seem to
>>>>> be to just make sure the BTB and RSB are flushed at vmexit time - and
>>>>> even then you might decide that you really want to just move it to
>>>>> vmenter time, and only do it if the VM has changed since last time
>>>>> (per CPU).
>>>>>
>>>>> Why do you even _care_ about the guest, and how it acts wrt Skylake?
>>>>> What you should care about is not so much the guests (which do their
>>>>> own thing) but protect guests from each other, no?
>>>>>
>>>>> So I'm a bit mystified by some of this discussion within the context
>>>>> of virtual machines. I think that is separate from any measures that
>>>>> the guest machine may then decide to partake in.
>>>>>
>>>>> If you are ever going to migrate to Skylake, I think you should just
>>>>> always tell the guests that you're running on Skylake. That way the
>>>>> guests will always assume the worst case situation wrt Specte.
>>>>>
>>>>> Maybe that mystification comes from me missing something.
>>>>
>>>> I can only speak for KVM, but I think the hypervisor issues come from
>>>> the fact that for migration purposes the hypervisor "lies" to the guest
>>>> in regard to what kind of CPU is running. (it has to lie, see below).
>>>>
>>>> This is to avoid random guest crashes by not announcing features. For
>>>> example if you want to migrate forth and back between a system that
>>>> has AVX512 and another one that has not you must tell the guest that
>>>> AVX512 is not available - even if it runs on the capable system.
>>>>
>>>> To protect against new features the hypervisor only announces features
>>>> that it understands.
>>>> So you essentially start a VM in QEMU of a given CPU type that is
>>>> constructed of a base cpu type plus extra features. Before migration,
>>>> it is checked if he target system can run a guest of given type -
>>>> otherwise migration is rejected.
>>>>
>>>> The management stack also knows things like baselining - basically
>>>> creating the best possible guest CPU given a set of hosts.
>>>>
>>>> The problem now is: If you have lets say Broadwell and Skylakes.
>>>> What kind of CPU type are you telling your guest? If you claim
>>>> broadwell but run on skylake then you prevent that the guest can
>>>> protect itself, because the guest does not know that it should do
>>>> something special. If you say skylake the guest might start using
>>>> features that broadwell does not understand.
>>>
>>> I believe that Linus’ question was whether it makes sense to defer
>>> the entirety of the protection to the host kernel, although I was a bit
>>> confused by his suggestion to always assume Skylake.
>>>
>>> In other words, is it safe enough to rely on the host kernel countermeasure
>>> to protect guest kernels and their applications? In which case having
>>> the guest believe it runs on Broadwell would not be that problematic.
>>>
>>> Aren’t there enough vmexits on the guest kernel context switch
>>> to enforce protection on its behalf? Even if it’s
>>>
>>> a) some old kernel that without mitigation code
>>>
>>> or
>>>
>>> b) some new kernel that thinks it runs on an old CPU and disabled mitigation
>>>
>> I think it is not safe to just protect the host. CPU bound workload in the guest
>> will switch a lot between guest user and guest kernel without triggering an
>> exit.
>
> But that’s only if the guest does not take any page faults. Is it possible to run any
> of the known approaches to spectre and meltdown without ever faulting?
Sure, after you have faulted in everything you can still flush the cache without refaulting,
And if you need a fault, it will be GUEST fault - no hypervisor involvment,
Everything else would be too slow and is pre NPT.
> If the workload is not faulting, then it’s reading only stuff it’s allowed to, isn’t it?
The point is: The hypervisor will not try to fix the guest userspace against guest kernel space
or other guest userspaces. This is clearly the task of the guest operating system (you are
also not asking the hypervisor build a guest kpti is the guest is too old).
The hypervisors task is to isolate guests against other guests and against the host.
At the same time the hypervisor will try to _enable_ the guest to also protect itself.
Powered by blists - more mailing lists