[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9a54bd8d-ea42-4c9b-afdc-a9ae3c31b034@oracle.com>
Date: Thu, 6 Nov 2025 15:41:18 -0800
From: Dongli Zhang <dongli.zhang@...cle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: kvm@...r.kernel.org, x86@...nel.org, linux-kernel@...r.kernel.org,
seanjc@...gle.com, pbonzini@...hat.com, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, joe.jin@...cle.com
Subject: Re: [PATCH 1/1] KVM: VMX: configure SVI during runtime APICv
activation
Hi Chao,
On 11/3/25 11:37 PM, Chao Gao wrote:
> On Mon, Nov 03, 2025 at 01:41:15PM -0800, Dongli Zhang wrote:
>> The APICv (apic->apicv_active) can be activated or deactivated at runtime,
>> for instance, because of APICv inhibit reasons. Intel VMX employs different
>> mechanisms to virtualize LAPIC based on whether APICv is active.
>>
>> When APICv is activated at runtime, GUEST_INTR_STATUS is used to configure
>> and report the current pending IRR and ISR states. Unless a specific vector
>> is explicitly included in EOI_EXIT_BITMAP, its EOI will not be trapped to
>> KVM. Intel VMX automatically clears the corresponding ISR bit based on the
>> GUEST_INTR_STATUS.SVI field.
>>
>> When APICv is deactivated at runtime, the VM_ENTRY_INTR_INFO_FIELD is used
>> to specify the next interrupt vector to invoke upon VM-entry. The
>> VMX IDT_VECTORING_INFO_FIELD is used to report un-invoked vectors on
>> VM-exit. EOIs are always trapped to KVM, so the software can manually clear
>> pending ISR bits.
>>
>> There are scenarios where, with APICv activated at runtime, a guest-issued
>> EOI may not be able to clear the pending ISR bit.
>>
>> Taking vector 236 as an example, here is one scenario.
>>
>> 1. Suppose APICv is inactive. Vector 236 is pending in the IRR.
>> 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR,
>> and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq().
>> 3. After VM-entry, vector 236 is invoked through the guest IDT. At this
>> point, the data in VM_ENTRY_INTR_INFO_FIELD is no longer valid. The guest
>> interrupt handler for vector 236 is invoked.
>> 4. Suppose a VM exit occurs very early in the guest interrupt handler,
>> before the EOI is issued.
>> 5. Nothing is reported through the IDT_VECTORING_INFO_FIELD because
>> vector 236 has already been invoked in the guest.
>> 6. Now, suppose APICv is activated. Before the next VM-entry, KVM calls
>> kvm_vcpu_update_apicv() to activate APICv.
>
> which APICv inhibitor is cleared in this step?
APICV_INHIBIT_REASON_APIC_ID_MODIFIED.
vCPU X another thread
__kvm_apic_set_base()
-> vcpu->arch.apic_base = value;
X2APIC_ENABLE is remained from
prior vCPU hotplug
Now X2APIC_ENABLE is removed.
APIC_ID is still in x2apic format
kvm_recalculate_apic_map()
-> kvm_for_each_vcpu()
-> xapic_id_mismatch
set APICV_INHIBIT_REASON_APIC_ID_MODIFIED
-> kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
set APIC_ID to legacy format
Any further kvm_recalculate_apic_map() can clear
APICV_INHIBIT_REASON_APIC_ID_MODIFIED.
There is more chance to encounter the racing window without below commit:
KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPIC
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=052c3b99cbc8d227f8cb8edf1519197808d1d653
I can also reproduce by customizing QEMU to edit APIC_ID and apic_base on purpose.
To facilitate diagnostic, I just expose inhibit reason via writable debugfs, in
order to enable/disable apicv on purpose in a bash loop script.
>
> <snip>
>
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index b4b5d2d09634..a20cca69f2ed 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10873,6 +10873,9 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
>> kvm_apic_update_apicv(vcpu);
>> kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
>>
>> + if (apic->apicv_active && !is_guest_mode(vcpu))
>> + kvm_apic_update_hwapic_isr(vcpu);
>> +
>
> Why is the nested case exempted here? IIUC, kvm_apic_update_hwapic_isr()
> guarantees an update to VMCS01's SVI even if the vCPU is in guest mode.
>
> And there is already a check against apicv_active right below. So, to be
> concise, how about:
>
> if (!apic->apicv_active)
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> else
> kvm_apic_update_hwapic_isr(vcpu);
Thank you very much for reminder.
I missed the scenario when vCPU is in L2. The __nested_vmx_vmexit() will not
call kvm_apic_update_hwapic_isr() unless 'update_vmcs01_hwapic_isr' is set to true.
However, can I remove the below WARN_ON_ONCE introduced by the commit
04bc93cf49d1 ("KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active
w/o VID")?
Now we need to call vmx_hwapic_isr_update() when the vCPU is running with vmcs12
VID configured.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f87c216d976d..d263dbf0b917 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6878,15 +6878,6 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int
max_isr)
* VM-Exit, otherwise L1 with run with a stale SVI.
*/
if (is_guest_mode(vcpu)) {
- /*
- * KVM is supposed to forward intercepted L2 EOIs to L1 if VID
- * is enabled in vmcs12; as above, the EOIs affect L2's vAPIC.
- * Note, userspace can stuff state while L2 is active; assert
- * that VID is disabled if and only if the vCPU is in KVM_RUN
- * to avoid false positives if userspace is setting APIC state.
- */
- WARN_ON_ONCE(vcpu->wants_to_run &&
- nested_cpu_has_vid(get_vmcs12(vcpu)));
to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true;
return;
}
Otherwise, we may encounter below WARNING.
[ 2510.134035] WARNING: CPU: 16 PID: 43475 at arch/x86/kvm/vmx/vmx.c:6889
vmx_hwapic_isr_update+0x1bf/0x270 [kvm_intel]
... ...
[ 2510.293290] Call Trace:
[ 2510.296090] <TASK>
[ 2510.298509] __kvm_vcpu_update_apicv+0x1c4/0x230 [kvm]
[ 2510.304432] vcpu_enter_guest+0x3a1f/0x48a0 [kvm]
[ 2510.309827] ? __pfx_vcpu_enter_guest+0x10/0x10 [kvm]
[ 2510.315612] ? vmx_get_rflags+0x21/0x180 [kvm_intel]
[ 2510.321313] ? kvm_cpu_has_interrupt+0x7d/0xe0 [kvm]
[ 2510.327047] kvm_arch_vcpu_ioctl_run+0x8ce/0x1d70 [kvm]
[ 2510.333060] kvm_vcpu_ioctl+0xabb/0x1060 [kvm]
[ 2510.338156] ? __pfx_kvm_vcpu_ioctl+0x10/0x10 [kvm]
[ 2510.343746] ? __pfx_file_has_perm+0x10/0x10
[ 2510.348625] ? futex_wake+0x14b/0x580
[ 2510.352800] ? futex_wait+0xc4/0x150
[ 2510.356877] ? __pfx_do_vfs_ioctl+0x10/0x10
[ 2510.361640] ? lock_vma_under_rcu+0x282/0x5f0
[ 2510.366648] ? __pfx_vfs_write+0x10/0x10
[ 2510.371126] ? do_futex+0x16c/0x240
[ 2510.375099] ? __pfx_ioctl_has_perm.constprop.76+0x10/0x10
[ 2510.381366] ? fdget_pos+0x391/0x4c0
[ 2510.391262] ? fput+0x24/0x70
[ 2510.400552] __x64_sys_ioctl+0x130/0x1a0
[ 2510.410858] do_syscall_64+0x50/0xfa0
[ 2510.420872] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> And the comment below can be extended to state that when APICv gets enabled,
> updating SVI is necessary; otherwise, SVI won't reflect the highest bit in vISR
> and the next EOI from the guest won't be virtualized correctly, as the CPU will
> clear the SVI bit from vISR.
I will add the comment.
Thank you very much!
Dongli Zhang
>
>> /*
>> * When APICv gets disabled, we may still have injected interrupts
>> * pending. At the same time, KVM_REQ_EVENT may not be set as APICv was
>> --
>> 2.39.3
>>
>>
Powered by blists - more mailing lists