linux-kernel - Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e3911fe6-84e8-cb50-d95d-e33f8ae005f8@redhat.com>
Date:   Wed, 9 Sep 2020 14:44:14 +0200
From:   Laszlo Ersek <lersek@...hat.com>
To:     Ard Biesheuvel <ardb@...nel.org>, Borislav Petkov <bp@...en8.de>,
        brijesh.singh@....com
Cc:     Joerg Roedel <joro@...tes.org>, X86 ML <x86@...nel.org>,
        Joerg Roedel <jroedel@...e.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Jiri Slaby <jslaby@...e.cz>,
        Dan Williams <dan.j.williams@...el.com>,
        Juergen Gross <jgross@...e.com>,
        Kees Cook <keescook@...omium.org>,
        David Rientjes <rientjes@...gle.com>,
        Cfir Cohen <cfir@...gle.com>,
        Erdem Aktas <erdemaktas@...gle.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Mike Stunes <mstunes@...are.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Martin Radev <martin.b.radev@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        kvm@...r.kernel.org, virtualization@...ts.linux-foundation.org
Subject: Re: [PATCH v7 71/72] x86/efi: Add GHCB mappings when SEV-ES is active

On 09/09/20 10:27, Ard Biesheuvel wrote:
> (adding Laszlo and Brijesh)
>
> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@...en8.de> wrote:
>>
>> + Ard so that he can ack the efi bits.
>>
>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>> From: Tom Lendacky <thomas.lendacky@....com>
>>>
>>> Calling down to EFI runtime services can result in the firmware
>>> performing VMGEXIT calls. The firmware is likely to use the GHCB of
>>> the OS (e.g., for setting EFI variables),

I've had to stare at this for a while.

Because, normally a VMGEXIT is supposed to occur like this:

- guest does something privileged
- resultant non-automatic exit (NAE) injects a #VC exception
- exception handler figures out what that privileged thing was
- exception handler submits request to hypervisor via GHCB contents plus
  VMGEXIT instruction

Point being, the agent that "owns" the exception handler is supposed to
pre-allocate or otherwise provide the GHCB too, for information passing.

So... what is the particular NAE that occurs during the execution of
UEFI runtime services (at OS runtime)?

And assuming it occurs, I'm unsure if the exception handler (IDT) at
that point is owned (temporarily?) by the firmware.

- If the #VC handler comes from the firmware, then I don't know why it
would use the OS's GHCB.

- If the #VC handler comes from the OS, then I don't understand why the
commit message says "firmware performing VMGEXIT", given that in this
case it would be the OS's #VC handler executing VMGEXIT.

So, I think the above commit message implies a VMGEXIT *without* a NAE /
#VC context. (Because, I fail to interpret the commit message in a NAE /
#VC context in any way; see above.)

OK, so let's see where the firmware performs a VMGEXIT *outside* of an
exception handler, *while* at OS runtime. There seems to be one, in file
"OvmfPkg/QemuFlashFvbServicesRuntimeDxe/QemuFlashDxe.c":

> VOID
> QemuFlashPtrWrite (
>   IN        volatile UINT8    *Ptr,
>   IN        UINT8             Value
>   )
> {
>   if (MemEncryptSevEsIsEnabled ()) {
>     MSR_SEV_ES_GHCB_REGISTER  Msr;
>     GHCB                      *Ghcb;
>
>     Msr.GhcbPhysicalAddress = AsmReadMsr64 (MSR_SEV_ES_GHCB);
>     Ghcb = Msr.Ghcb;
>
>     //
>     // Writing to flash is emulated by the hypervisor through the use of write
>     // protection. This won't work for an SEV-ES guest because the write won't
>     // be recognized as a true MMIO write, which would result in the required
>     // #VC exception. Instead, use the the VMGEXIT MMIO write support directly
>     // to perform the update.
>     //
>     VmgInit (Ghcb);
>     Ghcb->SharedBuffer[0] = Value;
>     Ghcb->SaveArea.SwScratch = (UINT64) (UINTN) Ghcb->SharedBuffer;
>     VmgExit (Ghcb, SVM_EXIT_MMIO_WRITE, (UINT64) (UINTN) Ptr, 1);
>     VmgDone (Ghcb);
>   } else {
>     *Ptr = Value;
>   }
> }

This function *does* run at OS runtime (as a part of non-volatile UEFI
variable writes).

And note that, wherever MSR_SEV_ES_GHCB points to at the moment, is used
as GHCB.

If the guest kernel allocates its own GHCB and writes the allocation
address to MSR_SEV_ES_GHCB, then indeed the firmware will use the GHCB
of the OS.

I reviewed edk2 commit 437eb3f7a8db
("OvmfPkg/QemuFlashFvbServicesRuntimeDxe: Bypass flash detection with
SEV-ES", 2020-08-17), but I admit I never thought of the guest OS
changing MSR_SEV_ES_GHCB. I'm sorry about that.

As long as this driver is running before OS runtime (i.e., during the
DXE and BDS phases), MSR_SEV_ES_GHCB is supposed to carry the value we
set in "OvmfPkg/PlatformPei/AmdSev.c":

> STATIC
> VOID
> AmdSevEsInitialize (
>   VOID
>   )
> {
>   VOID              *GhcbBase;
>   PHYSICAL_ADDRESS  GhcbBasePa;
>   UINTN             GhcbPageCount, PageCount;
>   RETURN_STATUS     PcdStatus, DecryptStatus;
>   IA32_DESCRIPTOR   Gdtr;
>   VOID              *Gdt;
>
>   if (!MemEncryptSevEsIsEnabled ()) {
>     return;
>   }
>
>   PcdStatus = PcdSetBoolS (PcdSevEsIsEnabled, TRUE);
>   ASSERT_RETURN_ERROR (PcdStatus);
>
>   //
>   // Allocate GHCB and per-CPU variable pages.
>   //   Since the pages must survive across the UEFI to OS transition
>   //   make them reserved.
>   //
>   GhcbPageCount = mMaxCpuCount * 2;
>   GhcbBase = AllocateReservedPages (GhcbPageCount);
>   ASSERT (GhcbBase != NULL);
>
>   GhcbBasePa = (PHYSICAL_ADDRESS)(UINTN) GhcbBase;
>
>   //
>   // Each vCPU gets two consecutive pages, the first is the GHCB and the
>   // second is the per-CPU variable page. Loop through the allocation and
>   // only clear the encryption mask for the GHCB pages.
>   //
>   for (PageCount = 0; PageCount < GhcbPageCount; PageCount += 2) {
>     DecryptStatus = MemEncryptSevClearPageEncMask (
>       0,
>       GhcbBasePa + EFI_PAGES_TO_SIZE (PageCount),
>       1,
>       TRUE
>       );
>     ASSERT_RETURN_ERROR (DecryptStatus);
>   }
>
>   ZeroMem (GhcbBase, EFI_PAGES_TO_SIZE (GhcbPageCount));
>
>   PcdStatus = PcdSet64S (PcdGhcbBase, GhcbBasePa);
>   ASSERT_RETURN_ERROR (PcdStatus);
>   PcdStatus = PcdSet64S (PcdGhcbSize, EFI_PAGES_TO_SIZE (GhcbPageCount));
>   ASSERT_RETURN_ERROR (PcdStatus);
>
>   DEBUG ((DEBUG_INFO,
>     "SEV-ES is enabled, %lu GHCB pages allocated starting at 0x%p\n",
>     (UINT64)GhcbPageCount, GhcbBase));
>
>   AsmWriteMsr64 (MSR_SEV_ES_GHCB, GhcbBasePa);

So what is the *actual* problem at OS runtime:

- Is it that MSR_SEV_ES_GHCB still points at this PEI-phase *reserved*
  memory allocation (and so when QemuFlashPtrWrite() tries to access it
  during OS runtime, it doesn't have a runtime mapping for it)?

- Or is it that the OS actively changes MSR_SEV_ES_GHCB, pointing to a
  memory area that the OS owns -- and *that* area is what
  QemuFlashPtrWrite() cannot access at OS runtime?

The first problem statement does *not* seem to apply, given -- again --
that the commit message says, "firmware is likely to use the GHCB of the
OS".

So I think the second problem statement must apply.

(I think the "reserved allocation" above is "reserved" only because we
want to keep the OS out of it around the ExitBootServices() transition.)

Back to the email:

On 09/09/20 10:27, Ard Biesheuvel wrote:
> On Tue, 8 Sep 2020 at 20:46, Borislav Petkov <bp@...en8.de> wrote:
>> On Mon, Sep 07, 2020 at 03:16:12PM +0200, Joerg Roedel wrote:
>>> so each GHCB in the system needs to be identity
>>> mapped in the EFI page tables, as unencrypted, to avoid page faults.

Not sure I agree about this, but at least it seems to confirm my
understanding -- apparently the idea is, for the OS, to satisfy
QemuFlashPtrWrite() in the firmware, by putting the "expected" mapping
-- for wherever MSR_SEV_ES_GHCB is going to point to -- in place.

>>>
>>> Signed-off-by: Tom Lendacky <thomas.lendacky@....com>
>>> [ jroedel@...e.de: Moved GHCB mapping loop to sev-es.c ]
>>> Signed-off-by: Joerg Roedel <jroedel@...e.de>
>
>
> This looks like it is papering over a more fundamental issue: any
> memory region that the firmware itself needs to access during the
> execution of runtime services needs to be described in the UEFI memory
> map, with the appropriate annotations so that the OS knows it should
> include these in the EFI runtime page tables. So why has this been
> omitted in this case?

So yeah, the issue seems to be that the QemuFlashFvbServicesRuntimeDxe
driver does not *own* the GHCB that it attempts to use at OS runtime. It
doesn't know where MSR_SEV_ES_GHCB is going to point.

Is QemuFlashFvbServicesRuntimeDxe permitted to change MSR_SEV_ES_GHCB
*temporarily* at OS runtime?

Because, in that case:

- QemuFlashFvbServicesRuntimeDxe should allocate a Runtime Services Data
  block for GHCB when it starts up (if SEV-ES is active),

- QemuFlashFvbServicesRuntimeDxe should register a SetVirtualAddressMap
  handler, and use EfiConvertPointer() from UefiRuntimeLib to convert
  the "runtime GHCB" address to virtual address, in that handler,

- QemuFlashPtrWrite() should call EfiAtRuntime() from UefiRuntimeLib,
  and if the latter returns TRUE, then (a) use the runtime-converted
  address for populating the GHCB, and (b) temporarily swap
  MSR_SEV_ES_GHCB with the address of the self-allocated GHCB. (The MSR
  needs a *physical* address, so QemuFlashFvbServicesRuntimeDxe would
  have to remember / retain the original (physical) allocation address
  too.)

If QemuFlashFvbServicesRuntimeDxe is not permitted to change
MSR_SEV_ES_GHCB even temporarily (at OS runtime), then I think the
approach proposed in this (guest kernel) patch is valid.

Let me skim the code below...

>
>
>
>>> ---
>>>  arch/x86/boot/compressed/sev-es.c |  1 +
>>>  arch/x86/include/asm/sev-es.h     |  2 ++
>>>  arch/x86/kernel/sev-es.c          | 30 ++++++++++++++++++++++++++++++
>>>  arch/x86/platform/efi/efi_64.c    | 10 ++++++++++
>>>  4 files changed, 43 insertions(+)
>>>
>>> diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c
>>> index 45702b866c33..0a9a248ca33d 100644
>>> --- a/arch/x86/boot/compressed/sev-es.c
>>> +++ b/arch/x86/boot/compressed/sev-es.c
>>> @@ -12,6 +12,7 @@
>>>   */
>>>  #include "misc.h"
>>>
>>> +#include <asm/pgtable_types.h>
>>>  #include <asm/sev-es.h>
>>>  #include <asm/trapnr.h>
>>>  #include <asm/trap_pf.h>
>>> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
>>> index e919f09ae33c..cf1d957c7091 100644
>>> --- a/arch/x86/include/asm/sev-es.h
>>> +++ b/arch/x86/include/asm/sev-es.h
>>> @@ -102,11 +102,13 @@ static __always_inline void sev_es_nmi_complete(void)
>>>       if (static_branch_unlikely(&sev_es_enable_key))
>>>               __sev_es_nmi_complete();
>>>  }
>>> +extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>>  #else
>>>  static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>>  static inline void sev_es_ist_exit(void) { }
>>>  static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>>  static inline void sev_es_nmi_complete(void) { }
>>> +static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>>  #endif
>>>
>>>  #endif
>>> diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c
>>> index 9ab3a4dfecd8..4e2b7e4d9b87 100644
>>> --- a/arch/x86/kernel/sev-es.c
>>> +++ b/arch/x86/kernel/sev-es.c
>>> @@ -491,6 +491,36 @@ int sev_es_setup_ap_jump_table(struct real_mode_header *rmh)
>>>       return 0;
>>>  }
>>>
>>> +/*
>>> + * This is needed by the OVMF UEFI firmware which will use whatever it finds in
>>> + * the GHCB MSR as its GHCB to talk to the hypervisor. So make sure the per-cpu
>>> + * runtime GHCBs used by the kernel are also mapped in the EFI page-table.

Yup, this pretty much confirms my suspicion that QemuFlashPtrWrite() is
at the center of this.

(BTW, I don't think that the runtime services data allocation, in
QemuFlashFvbServicesRuntimeDxe, for OS runtime GHCB purposes, would have
to be "per CPU". Refer to "Table 35. Rules for Reentry Into Runtime
Services" in the UEFI spec -- if one processor is executing
SetVariable(), then no other processor must enter SetVariable(). And so
we'll have *at most* one VCPU in QemuFlashPtrWrite(), at any time.)

>>> + */
>>> +int __init sev_es_efi_map_ghcbs(pgd_t *pgd)
>>> +{
>>> +     struct sev_es_runtime_data *data;
>>> +     unsigned long address, pflags;
>>> +     int cpu;
>>> +     u64 pfn;
>>> +
>>> +     if (!sev_es_active())
>>> +             return 0;
>>> +
>>> +     pflags = _PAGE_NX | _PAGE_RW;
>>> +
>>> +     for_each_possible_cpu(cpu) {
>>> +             data = per_cpu(runtime_data, cpu);
>>> +
>>> +             address = __pa(&data->ghcb_page);
>>> +             pfn = address >> PAGE_SHIFT;
>>> +
>>> +             if (kernel_map_pages_in_pgd(pgd, pfn, address, 1, pflags))
>>> +                     return 1;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>>  static enum es_result vc_handle_msr(struct ghcb *ghcb, struct es_em_ctxt *ctxt)
>>>  {
>>>       struct pt_regs *regs = ctxt->regs;
>>> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
>>> index 6af4da1149ba..8f5759df7776 100644
>>> --- a/arch/x86/platform/efi/efi_64.c
>>> +++ b/arch/x86/platform/efi/efi_64.c
>>> @@ -47,6 +47,7 @@
>>>  #include <asm/realmode.h>
>>>  #include <asm/time.h>
>>>  #include <asm/pgalloc.h>
>>> +#include <asm/sev-es.h>
>>>
>>>  /*
>>>   * We allocate runtime services regions top-down, starting from -4G, i.e.
>>> @@ -229,6 +230,15 @@ int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
>>>               return 1;
>>>       }
>>>
>>> +     /*
>>> +      * When SEV-ES is active, the GHCB as set by the kernel will be used
>>> +      * by firmware. Create a 1:1 unencrypted mapping for each GHCB.
>>> +      */
>>> +     if (sev_es_efi_map_ghcbs(pgd)) {
>>> +             pr_err("Failed to create 1:1 mapping for the GHCBs!\n");
>>> +             return 1;
>>> +     }
>>> +
>>>       /*
>>>        * When making calls to the firmware everything needs to be 1:1
>>>        * mapped and addressable with 32-bit pointers. Map the kernel

Good point!

And it even makes me wonder if the QemuFlashFvbServicesRuntimeDxe
approach, with the runtime services data type memory allocation, is
feasible at all. Namely, a page's encryption status, under SEV, is
controlled through the PTE.

And for this particular UEFI runtime area, it would *not* suffice for
the OS to just virt-map it. The OS would also have to *decrypt* the area
(mark the PTE as "plaintext").

In other words, it would be an "unprecedented" PTE for the OS to set up:
the PTE would not only map the GVA to GPA, but also mark the area as
"plaintext".

Otherwise -- if the OS covers *just* the virt-mapping --,
QemuFlashFvbServicesRuntimeDxe would populate its own "runtime GHCB"
area just fine, but the actual data hitting the host RAM would be
encrypted. And so the hypervisor could not interpret the GHCB.

*If* QemuFlashFvbServicesRuntimeDxe should not change the kernel-owned
PTE at runtime, even temporarily, for marking the GHCB as "plaintext",
then the problem is indeed only solvable in the guest kernel, in my
opinion.

There simply isn't an "architected annotation" for telling the kernel,
"virt-map this runtime services data type memory range, *and* mark it as
plaintext at the same time".

This would be necessary, as both actions affect the exact same PTE, and
the firmware is not really allowed to touch the PTE at runtime. But we
don't have such a hint.


To summarize: for QemuFlashFvbServicesRuntimeDxe to allocate UEFI
Runtime Services Data type memory, for its own runtime GHCB, two
permissions are necessary (together), at OS runtime:

- QemuFlashFvbServicesRuntimeDxe must be allowed to swap MSR_SEV_ES_GHCB
  temporarily (before executing VMGEXIT),

- QemuFlashFvbServicesRuntimeDxe must be allowed to change the OS-owned
  PTE temporarily (for remapping the GHCB as plaintext, before writing
  to it).

Thanks
Laszlo