[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 9 Sep 2020 08:49:35 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Laszlo Ersek <lersek@...hat.com>, Ard Biesheuvel <ardb@...nel.org>,
Borislav Petkov <bp@...en8.de>, brijesh.singh@....com
Cc: Joerg Roedel <joro@...tes.org>, X86 ML <x86@...nel.org>,
Joerg Roedel <jroedel@...e.de>,
"H. Peter Anvin" <hpa@...or.com>,
Andy Lutomirski <luto@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>,
Jiri Slaby <jslaby@...e.cz>,
Dan Williams <dan.j.williams@...el.com>,
Juergen Gross <jgross@...e.com>,
Kees Cook <keescook@...omium.org>,
David Rientjes <rientjes@...gle.com>,
Cfir Cohen <cfir@...gle.com>,
Erdem Aktas <erdemaktas@...gle.com>,
Masami Hiramatsu <mhiramat@...nel.org>,
Mike Stunes <mstunes@...are.com>,
Sean Christopherson <sean.j.christopherson@...el.com>,
Martin Radev <martin.b.radev@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active
On 9/9/20 7:44 AM, Laszlo Ersek wrote:
> On 09/09/20 10:27, Ard Biesheuvel wrote:
>> (adding Laszlo and Brijesh)
>>
>> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@...en8.de> wrote:
>>>
>>> + Ard so that he can ack the efi bits.
>>>
>>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>>> From: Tom Lendacky <thomas.lendacky@....com>
>>>>
>>>> Calling down to EFI runtime services can result in the firmware
>>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of
>>>> the OS (e.g., for setting EFI variables),
>
> I've had to stare at this for a while.
>
> Because, normally a VMGEXIT is supposed to occur like this:
>
> - guest does something privileged
> - resultant non-automatic exit (NAE) injects a #VC exception
> - exception handler figures out what that privileged thing was
> - exception handler submits request to hypervisor via GHCB contents plus
> VMGEXIT instruction
>
> Point being, the agent that "owns" the exception handler is supposed to
> pre-allocate or otherwise provide the GHCB too, for information passing.
>
> So... what is the particular NAE that occurs during the execution of
> UEFI runtime services (at OS runtime)?
>
> And assuming it occurs, I'm unsure if the exception handler (IDT) at
> that point is owned (temporarily?) by the firmware.
>
> - If the #VC handler comes from the firmware, then I don't know why it
> would use the OS's GHCB.
>
> - If the #VC handler comes from the OS, then I don't understand why the
> commit message says "firmware performing VMGEXIT", given that in this
> case it would be the OS's #VC handler executing VMGEXIT.
>
> So, I think the above commit message implies a VMGEXIT *without* a NAE /
> #VC context. (Because, I fail to interpret the commit message in a NAE /
> #VC context in any way; see above.)
Correct.
>
> OK, so let's see where the firmware performs a VMGEXIT *outside* of an
> exception handler, *while* at OS runtime. There seems to be one, in file
> "OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":
Again, correct. Basically this is what is invoked when setting UEFI variables.
>
>> VOID
>> QemuFlashPtrWrite (
>> IN volatile UINT8 *Ptr,
>> IN UINT8 Value
>> )
>> {
>> if (MemEncryptSevEsIsEnabled ()) {
>> MSR_SEV_ES_GHCB_REGISTER Msr;
>> GHCB *Ghcb;
>>
>> Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
>> Ghcb = Msr.Ghcb;
>>
>> //
>> // Writing to flash is emulated by the hypervisor through the use of write
>> // protection. This won't work for an SEV-ES guest because the write won't
>> // be recognized as a true MMIO write, which would result in the required
>> // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
>> // to perform the update.
>> //
>> VmgInit (Ghcb);
>> Ghcb->SharedBuffer[0] = Value;
>> Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
>> VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
>> VmgDone (Ghcb);
>> } else {
>> *Ptr = Value;
>> }
>> }
>
> This function *does* run at OS runtime (as a part of non-volatile UEFI
> variable writes).
>
> And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
> as GHCB.
>
> If the guest kernel allocates its own GHCB and writes the allocation
> address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
> of the OS.
>
> I reviewed edk2 commit 437eb3f7a8db
> ("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
> SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
> changing MSR_SEV_ES_GHCB. I'm sorry about that.
>
> As long as this driver is running before OS runtime (i.e., during the
> DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
> set in "OvmfPkg/PlatformPei/AmdSev.c":
>
>> STATIC
>> VOID
>> AmdSevEsInitialize (
>> VOID
>> )
>> {
>> VOID *GhcbBase;
>> PHYSICAL_ADDRESS GhcbBasePa;
>> UINTN GhcbPageCount, PageCount;
>> RETURN_STATUS PcdStatus, DecryptStatus;
>> IA32_DESCRIPTOR Gdtr;
>> VOID *Gdt;
>>
>> if (!MemEncryptSevEsIsEnabled ()) {
>> return;
>> }
>>
>> PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
>> ASSERT_RETURN_ERROR (PcdStatus);
>>
>> //
>> // Allocate GHCB and per-CPU variable pages.
>> // Since the pages must survive across the UEFI to OS transition
>> // make them reserved.
>> //
>> GhcbPageCount = mMaxCpuCount * 2;
>> GhcbBase = AllocateReservedPages (GhcbPageCount);
>> ASSERT (GhcbBase != NULL);
>>
>> GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;
>>
>> //
>> // Each vCPU gets two consecutive pages, the first is the GHCB and the
>> // second is the per-CPU variable page. Loop through the allocation and
>> // only clear the encryption mask for the GHCB pages.
>> //
>> for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
>> DecryptStatus = MemEncryptSevClearPageEncMask (
>> 0,
>> GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
>> 1,
>> TRUE
>> );
>> ASSERT_RETURN_ERROR (DecryptStatus);
>> }
>>
>> ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));
>>
>> PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa);
>> ASSERT_RETURN_ERROR (PcdStatus);
>> PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount));
>> ASSERT_RETURN_ERROR (PcdStatus);
>>
>> DEBUG ((DEBUG_INFO,
>> "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n",
>> (UINT64)GhcbPageCount, GhcbBase));
>>
>> AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa);
>
> So what is the *actual* problem at OS runtime:
>
> - Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved*
> memory allocation (and so when QemuFlashPtrWrite() tries to access it
> during OS runtime, it doesn't have a runtime mapping for it)?
At this point the GHCB MSR points to the OS GHCB, which isn't mapped by
the page tables supplied by the OS and used by UEFI.
>
> - Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a
> memory area that the OS owns -- and *that* area is what
> QemuFlashPtrWrite() cannot access at OS runtime?
Correct.
>
> The first problem statement does *not* seem to apply, given -- again --
> that the commit message says, "firmware is likely to use the GHCB of the
> OS".
>
> So I think the second problem statement must apply.
>
> (I think the "reserved allocation" above is "reserved" only because we
> want to keep the OS out of it around the ExitBootServices() transition.)
>
> Back to the email:
>
> On 09/09/20 10:27, Ard Biesheuvel wrote:
>> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@...en8.de> wrote:
>>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>>> so each GHCB in the system needs to be identity
>>>> mapped in the EFI page tables, as unencrypted, to avoid page faults.
>
> Not sure I agree about this, but at least it seems to confirm my
> understanding -- apparently the idea is, for the OS, to satisfy
> QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping
> -- for wherever MSR_SEV_ES_GHCB is going to point to -- in place.
>
>>>>
>>>> Signed-off-by: Tom Lendacky <thomas.lendacky@....com>
>>>> [ jroedel@...e.de: Moved GHCB mapping loop to sev-es.c ]
>>>> Signed-off-by: Joerg Roedel <jroedel@...e.de>
>>
>>
>> This looks like it is papering over a more fundamental issue: any
>> memory region that the firmware itself needs to access during the
>> execution of runtime services needs to be described in the UEFI memory
>> map, with the appropriate annotations so that the OS knows it should
>> include these in the EFI runtime page tables. So why has this been
>> omitted in this case?
>
> So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe
> driver does not *own* the GHCB that it attempts to use at OS runtime. It
> doesn't know where MSR_SEV_ES_GHCB is going to point.
>
> Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB
> *temporarily* at OS runtime?
>
> Because, in that case:
>
> - QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data
> block for GHCB when it starts up (if SEV-ES is active),
>
> - QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap
> handler, and use EfiConvertPointer() from UefiRuntimeLib to convert
> the "runtime GHCB" address to virtual address, in that handler,
>
> - QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib,
> and if the latter returns TRUE, then (a) use the runtime-converted
> address for populating the GHCB, and (b) temporarily swap
> MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR
> needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would
> have to remember / retain the original (physical) allocation address
> too.)
>
> If QemuFlashFvbServicesRuntimeDxe is not permitted to change
> MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the
> approach proposed in this (guest kernel) patch is valid.
>
> Let me skim the code below...
>
>>
>>
>>
>>>> ---
>>>> arch/x86/boot/compressed/sev-es.c | 1 +
>>>> arch/x86/include/asm/sev-es.h | 2 ++
>>>> arch/x86/kernel/sev-es.c | 30 ++++++++++++++++++++++++++++++
>>>> arch/x86/platform/efi/efi_64.c | 10 ++++++++++
>>>> 4 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
>>>> index 45702b866c33..0a9a248ca33d 100644
>>>> --- a/arch/x86/boot/compressed/sev-es.c
>>>> +++ b/arch/x86/boot/compressed/sev-es.c
>>>> @@ -12,6 +12,7 @@
>>>> */
>>>> #include "misc.h"
>>>>
>>>> +#include <asm/pgtable_types.h>
>>>> #include <asm/sev-es.h>
>>>> #include <asm/trapnr.h>
>>>> #include <asm/trap_pf.h>
>>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
>>>> index e919f09ae33c..cf1d957c7091 100644
>>>> --- a/arch/x86/include/asm/sev-es.h
>>>> +++ b/arch/x86/include/asm/sev-es.h
>>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
>>>> if (static_branch_unlikely(&sev_es_enable_key))
>>>> __sev_es_nmi_complete();
>>>> }
>>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>>> #else
>>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>>> static inline void sev_es_ist_exit(void) { }
>>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>>> static inline void sev_es_nmi_complete(void) { }
>>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>>> #endif
>>>>
>>>> #endif
>>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
>>>> --- a/arch/x86/kernel/sev-es.c
>>>> +++ b/arch/x86/kernel/sev-es.c
>>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
>>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
>>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.
>
> Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is
> at the center of this.
>
> (BTW, I don't think that the runtime services data allocation, in
> QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have
> to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime
> Services" in the UEFI spec -- if one processor is executing
> SetVariable(), then no other processor must enter SetVariable(). And so
> we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.)
>
>>>> + */
>>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
>>>> +{
>>>> + struct sev_es_runtime_data *data;
>>>> + unsigned long address, pflags;
>>>> + int cpu;
>>>> + u64 pfn;
>>>> +
>>>> + if (!sev_es_active())
>>>> + return 0;
>>>> +
>>>> + pflags = _PAGE_NX | _PAGE_RW;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + data = per_cpu(runtime_data, cpu);
>>>> +
>>>> + address = __pa(&data->ghcb_page);
>>>> + pfn = address >> PAGE_SHIFT;
>>>> +
>>>> + if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
>>>> + return 1;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>>>> {
>>>> struct pt_regs *regs = ctxt->regs;
>>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
>>>> index 6af4da1149ba..8f5759df7776 100644
>>>> --- a/arch/x86/platform/efi/efi_64.c
>>>> +++ b/arch/x86/platform/efi/efi_64.c
>>>> @@ -47,6 +47,7 @@
>>>> #include <asm/realmode.h>
>>>> #include <asm/time.h>
>>>> #include <asm/pgalloc.h>
>>>> +#include <asm/sev-es.h>
>>>>
>>>> /*
>>>> * We allocate runtime services regions top-down, starting from -4G, i.e.
>>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>>>> return 1;
>>>> }
>>>>
>>>> + /*
>>>> + * When SEV-ES is active, the GHCB as set by the kernel will be used
>>>> + * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
>>>> + */
>>>> + if (sev_es_efi_map_ghcbs(pgd)) {
>>>> + pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
>>>> + return 1;
>>>> + }
>>>> +
>>>> /*
>>>> * When making calls to the firmware everything needs to be 1:1
>>>> * mapped and addressable with 32-bit pointers. Map the kernel
>
> Good point!
>
> And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe
> approach, with the runtime services data type memory allocation, is
> feasible at all. Namely, a page's encryption status, under SEV, is
> controlled through the PTE.
>
> And for this particular UEFI runtime area, it would *not* suffice for
> the OS to just virt-map it. The OS would also have to *decrypt* the area
> (mark the PTE as "plaintext").
>
> In other words, it would be an "unprecedented" PTE for the OS to set up:
> the PTE would not only map the GVA to GPA, but also mark the area as
> "plaintext".
>
> Otherwise -- if the OS covers *just* the virt-mapping --,
> QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB"
> area just fine, but the actual data hitting the host RAM would be
> encrypted. And so the hypervisor could not interpret the GHCB.
>
> *If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned
> PTE at runtime, even temporarily, for marking the GHCB as "plaintext",
> then the problem is indeed only solvable in the guest kernel, in my
> opinion.
>
> There simply isn't an "architected annotation" for telling the kernel,
> "virt-map this runtime services data type memory range, *and* mark it as
> plaintext at the same time".
>
> This would be necessary, as both actions affect the exact same PTE, and
> the firmware is not really allowed to touch the PTE at runtime. But we
> don't have such a hint.
>
>
> To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
> Runtime Services Data type memory, for its own runtime GHCB, two
> permissions are necessary (together), at OS runtime:
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
> temporarily (before executing VMGEXIT),
>
> - QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
> PTE temporarily (for remapping the GHCB as plaintext, before writing
> to it).
>
Amazing summarization Laszlo!
Thanks,
Tom
> Thanks
> Laszlo
>
Powered by blists - more mailing lists