[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z+yBGgoqv3dcgfg6@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Apr 2025 08:13:14 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: VMX: Add a quirk to (not) honor guest PAT on CPUs
that support self-snoop
On Tue, Apr 01, 2025 at 03:11:07PM -0700, Sean Christopherson wrote:
> Add back support for honoring guest PAT on Intel CPUs that support self-
> snoop (and don't have errata), but guarded by a quirk so as not to break
> existing setups that subtly relied on KVM forcing WB for synthetic
> devices.
>
> This effectively reverts commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
> and reapplies 377b2f359d1f71c75f8cc352b5c81f2210312d83, but with a quirk.
>
> Cc: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>
Hi Sean,
> AFAIK, we don't have an answer as to whether the slow UC behavior on CLX+
> is working as intended or a CPU flaw, which Paolo was hoping we would get
We did answer the slow UC behavior is working as intended at [1].
"After consulting with CPU architects,
it's told that this behavior is expected on ICX/SPR Xeon platforms due to
the snooping implementation."
Paolo then help update the series to v2 [2] /v3 [3].
Did you overlook those series, or is there something I missed?
Thanks
Yan
[1] https://lore.kernel.org/all/20250224070716.31360-1-yan.y.zhao@intel.com/
[2] https://lore.kernel.org/all/20250301073428.2435768-1-pbonzini@redhat.com/
[3] https://lore.kernel.org/all/20250304060647.2903469-1-pbonzini@redhat.com/
> before adding a quirk. But I don't want to lose sight of honoring guest
> PAT, nor am I particularly inclined to force end users to wait for a
> definitive answer on hardware they may not even care about.
>
> Documentation/virt/kvm/api.rst | 25 +++++++++++++++++++++++++
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/include/uapi/asm/kvm.h | 1 +
> arch/x86/kvm/mmu.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
> arch/x86/kvm/vmx/vmx.c | 11 +++++++----
> arch/x86/kvm/x86.c | 2 +-
> 7 files changed, 50 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 1f8625b7646a..2a1444d99c37 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8158,6 +8158,31 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the
> and 0x489), as KVM does now allow them to
> be set by userspace (KVM sets them based on
> guest CPUID, for safety purposes).
> +
> +KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT By default, on Intel CPUs with TDP (EPT)
> + enabled, KVM ignores guest PAT unless the
> + VM has an assigned non-coherent device,
> + even if it is entirely safe/correct for KVM
> + to honor guest PAT. When this quirk is
> + disabled, and the host CPU fully supports
> + selfsnoop (isn't affected by errata), KVM
> + honors guest PAT for all VMs.
> +
> + The only _known_ issue with honoring guest
> + PAT is when QEMU's Bochs VGA is exposed to
> + a VM on Cascade Lake and later Intel server
> + CPUs, and the guest kernel is running an
> + outdated driver that maps video RAM as UC.
> + Accessing UC memory on the affected Intel
> + CPUs is an order of magnitude slower than
> + previous generations, to the point where
> + the access latency prevents the guest from
> + booting. This quirk can likely be disabled
> + if the above do not hold true.
> +
> + Note, KVM always honors guest PAT on AMD
> + CPUs when TDP (NPT) is enabled. KVM never
> + honors guest PAT when TDP is disabled.
> =================================== ============================================
>
> 7.32 KVM_CAP_MAX_VCPU_ID
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a884ab544335..427b906da5cc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2409,7 +2409,8 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
> KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS | \
> KVM_X86_QUIRK_SLOT_ZAP_ALL | \
> - KVM_X86_QUIRK_STUFF_FEATURE_MSRS)
> + KVM_X86_QUIRK_STUFF_FEATURE_MSRS | \
> + KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT)
>
> /*
> * KVM previously used a u32 field in kvm_run to indicate the hypercall was
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 460306b35a4b..074e2b74e68c 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -441,6 +441,7 @@ struct kvm_sync_regs {
> #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6)
> #define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7)
> #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS (1 << 8)
> +#define KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT (1 << 9)
>
> #define KVM_STATE_NESTED_FORMAT_VMX 0
> #define KVM_STATE_NESTED_FORMAT_SVM 1
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 050a0e229a4d..639264635a1a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -231,7 +231,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> return -(u32)fault & errcode;
> }
>
> -bool kvm_mmu_may_ignore_guest_pat(void);
> +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm);
>
> int kvm_mmu_post_init_vm(struct kvm *kvm);
> void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 63bb77ee1bb1..16c64e80d946 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4835,18 +4835,27 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> }
> #endif
>
> -bool kvm_mmu_may_ignore_guest_pat(void)
> +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm)
> {
> /*
> - * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> + * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> + * not support self-snoop (or is affected by an erratum), and the VM
> * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> * honor the memtype from the guest's PAT so that guest accesses to
> * memory that is DMA'd aren't cached against the guest's wishes. As a
> * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> - * KVM _always_ ignores guest PAT (when EPT is enabled).
> + * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> + * bits in response to non-coherent device (un)registration.
> + *
> + * Due to an unfortunate confluence of slow hardware, suboptimal guest
> + * drivers, and historical use cases, honoring self-snoop and guest PAT
> + * is also buried behind a quirk.
> */
> - return shadow_memtype_mask;
> + return (!static_cpu_has(X86_FEATURE_SELFSNOOP) ||
> + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT)) &&
> + shadow_memtype_mask;
> }
> +EXPORT_SYMBOL_GPL(kvm_mmu_may_ignore_guest_pat);
>
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b70ed72c1783..734db162cab3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7730,11 +7730,14 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>
> /*
> * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> - * device attached. Letting the guest control memory types on Intel
> - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> - * the guest to behave only as a last resort.
> + * device attached, and either the CPU doesn't support self-snoop or
> + * KVM's quirk to ignore guest PAT is enabled. Letting the guest
> + * control memory types on Intel CPUs without self-snoop may result in
> + * unexpected behavior, and so KVM's (historical) ABI is to trust the
> + * guest to behave only as a last resort.
> */
> - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> + if (kvm_mmu_may_ignore_guest_pat(vcpu->kvm) &&
> + !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>
> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c841817a914a..4a94eb974f0d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13528,7 +13528,7 @@ static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
> * (or last) non-coherent device is (un)registered to so that new SPTEs
> * with the correct "ignore guest PAT" setting are created.
> */
> - if (kvm_mmu_may_ignore_guest_pat())
> + if (kvm_mmu_may_ignore_guest_pat(kvm))
> kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> }
>
>
> base-commit: 782f9feaa9517caf33186dcdd6b50a8f770ed29b
> --
> 2.49.0.504.g3bcea36a83-goog
>
Powered by blists - more mailing lists