linux-kernel - Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aUq9_cUDWeEW_qli@google.com>
Date: Tue, 23 Dec 2025 08:06:35 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Naveen N Rao <naveen@...nel.org>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Peter Gonda <pgonda@...gle.com>, Michael Roth <michael.roth@....com>, 
	Vishal Annapurve <vannapurve@...gle.com>, Ackerly Tng <ackerleytng@...gle.com>, 
	Nikunj A Dadhania <nikunj@....com>, Tom Lendacky <thomas.lendacky@....com>
Subject: Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES
 and SEV-SNP (and TDX)

On Wed, Dec 03, 2025, Naveen N Rao wrote:
> Hi Sean,
> 
> On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> > Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> > directly emulate instructions for ES/SNP, and instead the guest must
> > explicitly request emulation.  Unless the guest explicitly requests
> > emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> > SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> > 
> > But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> > because except for ES/SNP, doing so requires setting reserved bits in the
> > SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> > writes.  Because KVM never creates MMIO SPTEs and jumps directly to
> > emulation, the guest never gets a #VC.  And since KVM simply resumes the
> > guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> > into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> > 
> > Disallow read-only memory for all VMs with protected state, i.e. for
> > upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> > to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> > fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> > protections and SUPPRESS_VE=0.  But there is no strong use case for
> > supporting read-only memslots on TDX, e.g. the main historical usage is
> > to emulate option ROMs, but TDX disallows executing from shared memory.
> > And if someone comes along with a legitimate, strong use case, the
> > restriction can always be lifted for TDX.
> > 
> > Don't bother trying to retroactively apply the restriction to SEV-ES
> > VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> > possibly work for SEV-ES, i.e. disallowing such memslots is really just
> > means reporting an error to userspace instead of silently hanging vCPUs.
> > Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> > isn't worth the marginal benefit it would provide userspace.
> > 
> > Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> > Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> > Cc: Peter Gonda <pgonda@...gle.com>
> > Cc: Michael Roth <michael.roth@....com>
> > Cc: Vishal Annapurve <vannapurve@...gle.com>
> > Cc: Ackerly Tng <ackerleytng@...gle.com>
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 2 ++
> >  include/linux/kvm_host.h        | 7 +++++++
> >  virt/kvm/kvm_main.c             | 5 ++---
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> As discussed in one of the previous PUCK calls, this is causing Qemu to 
> throw an error when trying to enable debug-swap for a SEV-ES guest when 
> using a pflash drive for OVMF. Sample qemu invocation (*):
>   qemu-system-x86_64 ... \
>     -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
>     -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
>     -machine q35,confidential-guest-support=sev0 \
>     -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on
> 
> This is expected since enabling debug-swap requires use of 
> KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However, 
> SEV-ES VMs that do not enable any VMSA SEV features (and are hence 
> KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they 
> are also susceptible to this issue.
> 
> One of the suggestions in the call was to consider returning an error to 
> userspace instead. Is this close to what you had in mind:
> 
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 73cdcbccc89e..19e27ed27e17 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * they can fix it by changing memory to shared, or they can
>          * provide a better error.
>          */
> -       if (r == RET_PF_EMULATE && fault.is_private) {
> -               pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
> +       if (r == RET_PF_EMULATE && (fault.is_private ||
> +           (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
> +               if (fault.is_private)
> +                       pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
>                 kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
>                 return -EFAULT;
>         }
> 
> This seems to work though Qemu seems to think we are asking it to 
> convert the memory to shared (so we probably need to signal this error 
> some other way?):
>   qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared
> 
> Thoughts?

The choke point would be kvm_handle_error_pfn() (see below), where the RET_PF_EMULATE
originates.  But looking at all of this again, I am opposed to changing KVM's
ABI to allow KVM_MEM_READONLY for SEV-ES guests, it simply can't work.  And KVM
enumerates as much.

	case KVM_CAP_READONLY_MEM:
		r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
		break;

More importantly, if QEMU wants to provide a not-fully-functional configuration
to allow KVM_SEV_INIT2 with pflash, QEMU can fudge around the lack of read-only
memory without KVM's assistance.  It likely won't be pretty, but it's doable,
by clearing PROT_WRITE in the backing VMA that's handed to the KVM memslot.

KVM will see a normal memslot that the guest can read/execute, and if the guest
attempts to write to the memory, hva_to_pfn() will return KVM_PFN_RR_FAULT and
kvm_handle_error_pfn() will send that out to userspace as -EFAULT.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f17324546900..27dc909b8225 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3493,8 +3493,12 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
         * into the spte otherwise read access on readonly gfn also can
         * caused mmio page fault and treat it as mmio access.
         */
-       if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+       if (fault->pfn == KVM_PFN_ERR_RO_FAULT) {
+               if (kvm_arch_has_readonly_mem(vcpu->kvm))
+                       return -EFAULT;
+
                return RET_PF_EMULATE;
+       }
 
        if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
                kvm_send_hwpoison_signal(fault->slot, fault->gfn);