lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251128123202.68424a95@imammedo>
Date: Fri, 28 Nov 2025 12:32:02 +0100
From: Igor Mammedov <imammedo@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org, Jim Mattson <jmattson@...gle.com>,
 mlevitsk@...hat.com
Subject: Re: [PATCH 5/5] KVM: x86: Defer runtime updates of dynamic CPUID
 bits until CPUID emulation

On Tue, 10 Dec 2024 17:33:02 -0800
Sean Christopherson <seanjc@...gle.com> wrote:

Sean,

this patch broke vCPU hotplug (still broken with current master),
after repeated plug/unplug of the same vCPU in a loop, QEMU exits
due to error in vcpu initialization:

    r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);                         
    if (r) {                                                                     
        goto fail;                                                               
    }

Reproducer (host in question is Haswell but it's been seen on other hosts as well):
for it to trigger the issue it must be Q35 machine with UEFI firmware
(the rest doesn't seem to matter)
===

#!/bin/sh

/tmp/qemu_build/qemu-system-x86_64 -M q35,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
    -m 16G -cpu host -smp 1,maxcpus=2 -enable-kvm \
    -blockdev node-name=file_ovmf_code,driver=file,filename=edk2-x86_64-secure-code.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_code,driver=raw,read-only=on,file=file_ovmf_code \
    -blockdev node-name=file_ovmf_vars,driver=file,filename=myOVMF_VARS.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_vars,driver=raw,read-only=off,file=file_ovmf_vars \
    -monitor unix:/tmp/monitor2,server,nowait -snapshot \
    ./rhel10-efi.qcow2 &

sleep 120

i=0
while [ $i -lt 3 ] 
do
	echo "====== The $(($i+1)) iterations ======"
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	echo "device_add host-x86_64-cpu,core-id=1,thread-id=0,socket-id=0,id=core1" | nc -U /tmp/monitor2
	sleep 6 
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	echo "device_del core1" | nc -U /tmp/monitor2
	sleep 6
	echo "info cpus" | nc -U /tmp/monitor2
	sleep 2
	((i++))
done

===


> Defer runtime CPUID updates until the next non-faulting CPUID emulation
> or KVM_GET_CPUID2, which are the only paths in KVM that consume the
> dynamic entries.  Deferring the updates is especially beneficial to
> nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
> changes, not to mention the updates don't need to be realized while L2 is
> active, as CPUID is a mandatory intercept on both Intel and AMD.
> 
> Deferring CPUID updates shaves several hundred cycles from nested VMX
> roundtrips, as measured from L2 executing CPUID in a tight loop:
> 
>   SKX 6850 => 6450
>   ICX 9000 => 8800
>   EMR 7900 => 7700
> 
> Alternatively, KVM could update only the CPUID leaves that are affected
> by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
> but that adds non-trivial complexity and doesn't solve the underlying
> problem of nested transitions potentially changing both XCR0 and XSS, on
> both nested VM-Enter and VM-Exit.
> 
> KVM could also skip updates entirely while L2 is active, because again
> CPUID is a mandatory intercept.  However, simply skipping updates if L2
> is active is *very* subtly dangerous and complex.  Most KVM updates are
> triggered by changes to the current vCPU state, which may be L2 state
> whereas performing updates only for L1 would requiring detecting changes
> to L1 state.  KVM would need to either track relevant L1 state, or defer
> runtime CPUID updates until the next nested VM-Exit.  The former is ugly
> and complex, while the latter comes with similar dangers to deferring all
> CPUID updates, and would only address the nested VM-Enter path.
> 
> To guard against using stale data, disallow querying dynamic CPUID feature
> bits, i.e. features that KVM updates at runtime, via a compile-time
> assertion in guest_cpu_cap_has().  Exempt MWAIT from the rule, as the
> MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
> feature.
> 
> Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
> CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
> doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
> of worms.  MONITOR/MWAIT can't be virtualized (for a reasonable definition),
> and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
> means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
> wrong for other reasons.
> 
> Beyond the aforementioned feature bits, the only other dynamic CPUID
> (sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
> CPUID entries in KVM is all but guaranteed to be a bug.  The layout for an
> actual XSAVE buffer depends on the format (compacted or not) and
> potentially the features that are actually enabled.  E.g. see the logic in
> fpstate_clear_xstate_component() needed to poke into the guest's effective
> XSAVE state to clear MPX state on INIT.  KVM does consume
> CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
> but not EBX, which is the only dynamic output register in the leaf.
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 12 ++++++++++--
>  arch/x86/kvm/cpuid.h            |  9 ++++++++-
>  arch/x86/kvm/lapic.c            |  2 +-
>  arch/x86/kvm/smm.c              |  2 +-
>  arch/x86/kvm/svm/sev.c          |  2 +-
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  2 +-
>  arch/x86/kvm/x86.c              |  6 +++---
>  9 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 81ce8cd5814a..23cc5c10060e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -871,6 +871,7 @@ struct kvm_vcpu_arch {
>  
>  	int cpuid_nent;
>  	struct kvm_cpuid_entry2 *cpuid_entries;
> +	bool cpuid_dynamic_bits_dirty;
>  	bool is_amd_compatible;
>  
>  	/*
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7f5fa6665969..54ba1a75b779 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -195,6 +195,7 @@ static int kvm_check_cpuid(struct kvm_vcpu *vcpu)
>  }
>  
>  static u32 kvm_apply_cpuid_pv_features_quirk(struct kvm_vcpu *vcpu);
> +static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
>  
>  /* Check whether the supplied CPUID data is equal to what is already set for the vCPU. */
>  static int kvm_cpuid_check_equal(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
> @@ -299,10 +300,12 @@ static __always_inline void kvm_update_feature_runtime(struct kvm_vcpu *vcpu,
>  	guest_cpu_cap_change(vcpu, x86_feature, has_feature);
>  }
>  
> -void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
> +static void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *best;
>  
> +	vcpu->arch.cpuid_dynamic_bits_dirty = false;
> +
>  	best = kvm_find_cpuid_entry(vcpu, 1);
>  	if (best) {
>  		kvm_update_feature_runtime(vcpu, best, X86_FEATURE_OSXSAVE,
> @@ -332,7 +335,6 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>  		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>  		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>  }
> -EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>  
>  static bool kvm_cpuid_has_hyperv(struct kvm_vcpu *vcpu)
>  {
> @@ -645,6 +647,9 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
>  	if (cpuid->nent < vcpu->arch.cpuid_nent)
>  		return -E2BIG;
>  
> +	if (vcpu->arch.cpuid_dynamic_bits_dirty)
> +		kvm_update_cpuid_runtime(vcpu);
> +
>  	if (copy_to_user(entries, vcpu->arch.cpuid_entries,
>  			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
>  		return -EFAULT;
> @@ -1983,6 +1988,9 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>  	struct kvm_cpuid_entry2 *entry;
>  	bool exact, used_max_basic = false;
>  
> +	if (vcpu->arch.cpuid_dynamic_bits_dirty)
> +		kvm_update_cpuid_runtime(vcpu);
> +
>  	entry = kvm_find_cpuid_entry_index(vcpu, function, index);
>  	exact = !!entry;
>  
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 67d80aa72d50..d2884162a46a 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -11,7 +11,6 @@ extern u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
>  void kvm_set_cpu_caps(void);
>  
>  void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
> -void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu);
>  struct kvm_cpuid_entry2 *kvm_find_cpuid_entry_index(struct kvm_vcpu *vcpu,
>  						    u32 function, u32 index);
>  struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,
> @@ -232,6 +231,14 @@ static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
>  {
>  	unsigned int x86_leaf = __feature_leaf(x86_feature);
>  
> +	/*
> +	 * Except for MWAIT, querying dynamic feature bits is disallowed, so
> +	 * that KVM can defer runtime updates until the next CPUID emulation.
> +	 */
> +	BUILD_BUG_ON(x86_feature == X86_FEATURE_APIC ||
> +		     x86_feature == X86_FEATURE_OSXSAVE ||
> +		     x86_feature == X86_FEATURE_OSPKE);
> +
>  	return vcpu->arch.cpu_caps[x86_leaf] & __feature_bit(x86_feature);
>  }
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ae81ae27d534..cf74c87b8b3f 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2585,7 +2585,7 @@ static void __kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  	vcpu->arch.apic_base = value;
>  
>  	if ((old_value ^ value) & MSR_IA32_APICBASE_ENABLE)
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  
>  	if (!apic)
>  		return;
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index e0ab7df27b66..699e551ec93b 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -358,7 +358,7 @@ void enter_smm(struct kvm_vcpu *vcpu)
>  			goto error;
>  #endif
>  
> -	kvm_update_cpuid_runtime(vcpu);
> +	vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	kvm_mmu_reset_context(vcpu);
>  	return;
>  error:
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 09be12a44288..5e4581ed0ef1 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3274,7 +3274,7 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm)
>  
>  	if (kvm_ghcb_xcr0_is_valid(svm)) {
>  		vcpu->arch.xcr0 = ghcb_get_xcr0(ghcb);
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	}
>  
>  	/* Copy the GHCB exit information into the VMCB fields */
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 07911ddf1efe..6a350cee2f6c 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1936,7 +1936,7 @@ void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  	vmcb_mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
>  
>  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  }
>  
>  static void svm_set_segment(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf872d8691b5..b5f3c5628bfd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3516,7 +3516,7 @@ void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  	vmcs_writel(GUEST_CR4, hw_cr4);
>  
>  	if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  }
>  
>  void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dc8829712edd..10b7d8c01e4d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1264,7 +1264,7 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  	vcpu->arch.xcr0 = xcr0;
>  
>  	if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  	return 0;
>  }
>  
> @@ -3899,7 +3899,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			if (!guest_cpu_cap_has(vcpu, X86_FEATURE_XMM3))
>  				return 1;
>  			vcpu->arch.ia32_misc_enable_msr = data;
> -			kvm_update_cpuid_runtime(vcpu);
> +			vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  		} else {
>  			vcpu->arch.ia32_misc_enable_msr = data;
>  		}
> @@ -3934,7 +3934,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (data & ~kvm_caps.supported_xss)
>  			return 1;
>  		vcpu->arch.ia32_xss = data;
> -		kvm_update_cpuid_runtime(vcpu);
> +		vcpu->arch.cpuid_dynamic_bits_dirty = true;
>  		break;
>  	case MSR_SMI_COUNT:
>  		if (!msr_info->host_initiated)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ