linux-kernel - Re: [PATCH] KVM: VMX: Add a quirk to (not) honor guest PAT on CPUs that support self-snoop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z+yBGgoqv3dcgfg6@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Apr 2025 08:13:14 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: VMX: Add a quirk to (not) honor guest PAT on CPUs
 that support self-snoop

On Tue, Apr 01, 2025 at 03:11:07PM -0700, Sean Christopherson wrote:
> Add back support for honoring guest PAT on Intel CPUs that support self-
> snoop (and don't have errata), but guarded by a quirk so as not to break
> existing setups that subtly relied on KVM forcing WB for synthetic
> devices.
> 
> This effectively reverts commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
> and reapplies 377b2f359d1f71c75f8cc352b5c81f2210312d83, but with a quirk.
> 
> Cc: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>
Hi Sean,

> AFAIK, we don't have an answer as to whether the slow UC behavior on CLX+
> is working as intended or a CPU flaw, which Paolo was hoping we would get
We did answer the slow UC behavior is working as intended at [1].

"After consulting with CPU architects,
 it's told that this behavior is expected on ICX/SPR Xeon platforms due to
 the snooping implementation."

Paolo then help update the series to v2 [2] /v3 [3].

Did you overlook those series, or is there something I missed?

Thanks
Yan

[1] https://lore.kernel.org/all/20250224070716.31360-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/all/20250301073428.2435768-1-pbonzini@redhat.com/
[3] https://lore.kernel.org/all/20250304060647.2903469-1-pbonzini@redhat.com/
> before adding a quirk.  But I don't want to lose sight of honoring guest
> PAT, nor am I particularly inclined to force end users to wait for a
> definitive answer on hardware they may not even care about.


> 
>  Documentation/virt/kvm/api.rst  | 25 +++++++++++++++++++++++++
>  arch/x86/include/asm/kvm_host.h |  3 ++-
>  arch/x86/include/uapi/asm/kvm.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 17 +++++++++++++----
>  arch/x86/kvm/vmx/vmx.c          | 11 +++++++----
>  arch/x86/kvm/x86.c              |  2 +-
>  7 files changed, 50 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1f8625b7646a..2a1444d99c37 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8158,6 +8158,31 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS    By default, at vCPU creation, KVM sets the
>                                      and 0x489), as KVM does now allow them to
>                                      be set by userspace (KVM sets them based on
>                                      guest CPUID, for safety purposes).
> +
> +KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT  By default, on Intel CPUs with TDP (EPT)
> +                                    enabled, KVM ignores guest PAT unless the
> +                                    VM has an assigned non-coherent device,
> +                                    even if it is entirely safe/correct for KVM
> +                                    to honor guest PAT.  When this quirk is
> +                                    disabled, and the host CPU fully supports
> +                                    selfsnoop (isn't affected by errata), KVM
> +                                    honors guest PAT for all VMs.
> +
> +                                    The only _known_ issue with honoring guest
> +                                    PAT is when QEMU's Bochs VGA is exposed to
> +                                    a VM on Cascade Lake and later Intel server
> +                                    CPUs, and the guest kernel is running an
> +                                    outdated driver that maps video RAM as UC.
> +                                    Accessing UC memory on the affected Intel
> +                                    CPUs is an order of magnitude slower than
> +                                    previous generations, to the point where
> +                                    the access latency prevents the guest from
> +                                    booting.  This quirk can likely be disabled
> +                                    if the above do not hold true.
> +
> +                                    Note, KVM always honors guest PAT on AMD
> +                                    CPUs when TDP (NPT) is enabled.  KVM never
> +                                    honors guest PAT when TDP is disabled.
>  =================================== ============================================
>  
>  7.32 KVM_CAP_MAX_VCPU_ID
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a884ab544335..427b906da5cc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2409,7 +2409,8 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
>  	 KVM_X86_QUIRK_FIX_HYPERCALL_INSN |	\
>  	 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS |	\
>  	 KVM_X86_QUIRK_SLOT_ZAP_ALL |		\
> -	 KVM_X86_QUIRK_STUFF_FEATURE_MSRS)
> +	 KVM_X86_QUIRK_STUFF_FEATURE_MSRS |	\
> +	 KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT)
>  
>  /*
>   * KVM previously used a u32 field in kvm_run to indicate the hypercall was
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 460306b35a4b..074e2b74e68c 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -441,6 +441,7 @@ struct kvm_sync_regs {
>  #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS	(1 << 6)
>  #define KVM_X86_QUIRK_SLOT_ZAP_ALL		(1 << 7)
>  #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS	(1 << 8)
> +#define KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT	(1 << 9)
>  
>  #define KVM_STATE_NESTED_FORMAT_VMX	0
>  #define KVM_STATE_NESTED_FORMAT_SVM	1
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 050a0e229a4d..639264635a1a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -231,7 +231,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>  	return -(u32)fault & errcode;
>  }
>  
> -bool kvm_mmu_may_ignore_guest_pat(void);
> +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm);
>  
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
>  void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 63bb77ee1bb1..16c64e80d946 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4835,18 +4835,27 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  }
>  #endif
>  
> -bool kvm_mmu_may_ignore_guest_pat(void)
> +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
> +	 *
> +	 * Due to an unfortunate confluence of slow hardware, suboptimal guest
> +	 * drivers, and historical use cases, honoring self-snoop and guest PAT
> +	 * is also buried behind a quirk.
>  	 */
> -	return shadow_memtype_mask;
> +	return (!static_cpu_has(X86_FEATURE_SELFSNOOP) ||
> +		kvm_check_has_quirk(kvm, KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT)) &&
> +	       shadow_memtype_mask;
>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_may_ignore_guest_pat);
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b70ed72c1783..734db162cab3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7730,11 +7730,14 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached, and either the CPU doesn't support self-snoop or
> +	 * KVM's quirk to ignore guest PAT is enabled.  Letting the guest
> +	 * control memory types on Intel CPUs without self-snoop may result in
> +	 * unexpected behavior, and so KVM's (historical) ABI is to trust the
> +	 * guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (kvm_mmu_may_ignore_guest_pat(vcpu->kvm) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c841817a914a..4a94eb974f0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13528,7 +13528,7 @@ static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
>  	 * (or last) non-coherent device is (un)registered to so that new SPTEs
>  	 * with the correct "ignore guest PAT" setting are created.
>  	 */
> -	if (kvm_mmu_may_ignore_guest_pat())
> +	if (kvm_mmu_may_ignore_guest_pat(kvm))
>  		kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
>  }
>  
> 
> base-commit: 782f9feaa9517caf33186dcdd6b50a8f770ed29b
> -- 
> 2.49.0.504.g3bcea36a83-goog
>