[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aUq9_cUDWeEW_qli@google.com>
Date: Tue, 23 Dec 2025 08:06:35 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Naveen N Rao <naveen@...nel.org>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Peter Gonda <pgonda@...gle.com>, Michael Roth <michael.roth@....com>,
Vishal Annapurve <vannapurve@...gle.com>, Ackerly Tng <ackerleytng@...gle.com>,
Nikunj A Dadhania <nikunj@....com>, Tom Lendacky <thomas.lendacky@....com>
Subject: Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES
and SEV-SNP (and TDX)
On Wed, Dec 03, 2025, Naveen N Rao wrote:
> Hi Sean,
>
> On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> > Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> > directly emulate instructions for ES/SNP, and instead the guest must
> > explicitly request emulation. Unless the guest explicitly requests
> > emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> > SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> >
> > But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> > because except for ES/SNP, doing so requires setting reserved bits in the
> > SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> > writes. Because KVM never creates MMIO SPTEs and jumps directly to
> > emulation, the guest never gets a #VC. And since KVM simply resumes the
> > guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> > into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> >
> > Disallow read-only memory for all VMs with protected state, i.e. for
> > upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
> > to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> > fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> > protections and SUPPRESS_VE=0. But there is no strong use case for
> > supporting read-only memslots on TDX, e.g. the main historical usage is
> > to emulate option ROMs, but TDX disallows executing from shared memory.
> > And if someone comes along with a legitimate, strong use case, the
> > restriction can always be lifted for TDX.
> >
> > Don't bother trying to retroactively apply the restriction to SEV-ES
> > VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
> > possibly work for SEV-ES, i.e. disallowing such memslots is really just
> > means reporting an error to userspace instead of silently hanging vCPUs.
> > Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> > isn't worth the marginal benefit it would provide userspace.
> >
> > Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> > Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> > Cc: Peter Gonda <pgonda@...gle.com>
> > Cc: Michael Roth <michael.roth@....com>
> > Cc: Vishal Annapurve <vannapurve@...gle.com>
> > Cc: Ackerly Tng <ackerleytng@...gle.com>
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 ++
> > include/linux/kvm_host.h | 7 +++++++
> > virt/kvm/kvm_main.c | 5 ++---
> > 3 files changed, 11 insertions(+), 3 deletions(-)
>
> As discussed in one of the previous PUCK calls, this is causing Qemu to
> throw an error when trying to enable debug-swap for a SEV-ES guest when
> using a pflash drive for OVMF. Sample qemu invocation (*):
> qemu-system-x86_64 ... \
> -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
> -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
> -machine q35,confidential-guest-support=sev0 \
> -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on
>
> This is expected since enabling debug-swap requires use of
> KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However,
> SEV-ES VMs that do not enable any VMSA SEV features (and are hence
> KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they
> are also susceptible to this issue.
>
> One of the suggestions in the call was to consider returning an error to
> userspace instead. Is this close to what you had in mind:
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 73cdcbccc89e..19e27ed27e17 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> * they can fix it by changing memory to shared, or they can
> * provide a better error.
> */
> - if (r == RET_PF_EMULATE && fault.is_private) {
> - pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
> + if (r == RET_PF_EMULATE && (fault.is_private ||
> + (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
> + if (fault.is_private)
> + pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
> kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> return -EFAULT;
> }
>
> This seems to work though Qemu seems to think we are asking it to
> convert the memory to shared (so we probably need to signal this error
> some other way?):
> qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared
>
> Thoughts?
The choke point would be kvm_handle_error_pfn() (see below), where the RET_PF_EMULATE
originates. But looking at all of this again, I am opposed to changing KVM's
ABI to allow KVM_MEM_READONLY for SEV-ES guests, it simply can't work. And KVM
enumerates as much.
case KVM_CAP_READONLY_MEM:
r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
break;
More importantly, if QEMU wants to provide a not-fully-functional configuration
to allow KVM_SEV_INIT2 with pflash, QEMU can fudge around the lack of read-only
memory without KVM's assistance. It likely won't be pretty, but it's doable,
by clearing PROT_WRITE in the backing VMA that's handed to the KVM memslot.
KVM will see a normal memslot that the guest can read/execute, and if the guest
attempts to write to the memory, hva_to_pfn() will return KVM_PFN_RR_FAULT and
kvm_handle_error_pfn() will send that out to userspace as -EFAULT.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f17324546900..27dc909b8225 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3493,8 +3493,12 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
* into the spte otherwise read access on readonly gfn also can
* caused mmio page fault and treat it as mmio access.
*/
- if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+ if (fault->pfn == KVM_PFN_ERR_RO_FAULT) {
+ if (kvm_arch_has_readonly_mem(vcpu->kvm))
+ return -EFAULT;
+
return RET_PF_EMULATE;
+ }
if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(fault->slot, fault->gfn);
Powered by blists - more mailing lists