linux-kernel - Re: [PATCH 1/1] KVM: VMX: configure SVI during runtime APICv activation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a54bd8d-ea42-4c9b-afdc-a9ae3c31b034@oracle.com>
Date: Thu, 6 Nov 2025 15:41:18 -0800
From: Dongli Zhang <dongli.zhang@...cle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: kvm@...r.kernel.org, x86@...nel.org, linux-kernel@...r.kernel.org,
        seanjc@...gle.com, pbonzini@...hat.com, tglx@...utronix.de,
        mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, joe.jin@...cle.com
Subject: Re: [PATCH 1/1] KVM: VMX: configure SVI during runtime APICv
 activation

Hi Chao,

On 11/3/25 11:37 PM, Chao Gao wrote:
> On Mon, Nov 03, 2025 at 01:41:15PM -0800, Dongli Zhang wrote:
>> The APICv (apic->apicv_active) can be activated or deactivated at runtime,
>> for instance, because of APICv inhibit reasons. Intel VMX employs different
>> mechanisms to virtualize LAPIC based on whether APICv is active.
>>
>> When APICv is activated at runtime, GUEST_INTR_STATUS is used to configure
>> and report the current pending IRR and ISR states. Unless a specific vector
>> is explicitly included in EOI_EXIT_BITMAP, its EOI will not be trapped to
>> KVM. Intel VMX automatically clears the corresponding ISR bit based on the
>> GUEST_INTR_STATUS.SVI field.
>>
>> When APICv is deactivated at runtime, the VM_ENTRY_INTR_INFO_FIELD is used
>> to specify the next interrupt vector to invoke upon VM-entry. The
>> VMX IDT_VECTORING_INFO_FIELD is used to report un-invoked vectors on
>> VM-exit. EOIs are always trapped to KVM, so the software can manually clear
>> pending ISR bits.
>>
>> There are scenarios where, with APICv activated at runtime, a guest-issued
>> EOI may not be able to clear the pending ISR bit.
>>
>> Taking vector 236 as an example, here is one scenario.
>>
>> 1. Suppose APICv is inactive. Vector 236 is pending in the IRR.
>> 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR,
>> and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq().
>> 3. After VM-entry, vector 236 is invoked through the guest IDT. At this
>> point, the data in VM_ENTRY_INTR_INFO_FIELD is no longer valid. The guest
>> interrupt handler for vector 236 is invoked.
>> 4. Suppose a VM exit occurs very early in the guest interrupt handler,
>> before the EOI is issued.
>> 5. Nothing is reported through the IDT_VECTORING_INFO_FIELD because
>> vector 236 has already been invoked in the guest.
>> 6. Now, suppose APICv is activated. Before the next VM-entry, KVM calls
>> kvm_vcpu_update_apicv() to activate APICv.
> 
> which APICv inhibitor is cleared in this step?

APICV_INHIBIT_REASON_APIC_ID_MODIFIED.


      vCPU X                               another thread

__kvm_apic_set_base()

-> vcpu->arch.apic_base = value;
   X2APIC_ENABLE is remained from
          prior vCPU hotplug
   Now X2APIC_ENABLE is removed.
   APIC_ID is still in x2apic format

                                       kvm_recalculate_apic_map()
                                       -> kvm_for_each_vcpu()
                                          -> xapic_id_mismatch
                                set APICV_INHIBIT_REASON_APIC_ID_MODIFIED

-> kvm_apic_set_xapic_id(apic, vcpu->vcpu_id);
   set APIC_ID to legacy format

Any further kvm_recalculate_apic_map() can clear
APICV_INHIBIT_REASON_APIC_ID_MODIFIED.

There is more chance to encounter the racing window without below commit:

KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPIC
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=052c3b99cbc8d227f8cb8edf1519197808d1d653


I can also reproduce by customizing QEMU to edit APIC_ID and apic_base on purpose.

To facilitate diagnostic, I just expose inhibit reason via writable debugfs, in
order to enable/disable apicv on purpose in a bash loop script.

> 
> <snip>
> 
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index b4b5d2d09634..a20cca69f2ed 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10873,6 +10873,9 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
>> 	kvm_apic_update_apicv(vcpu);
>> 	kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
>>
>> +	if (apic->apicv_active && !is_guest_mode(vcpu))
>> +		kvm_apic_update_hwapic_isr(vcpu);
>> +
> 
> Why is the nested case exempted here? IIUC, kvm_apic_update_hwapic_isr()
> guarantees an update to VMCS01's SVI even if the vCPU is in guest mode.
> 
> And there is already a check against apicv_active right below. So, to be
> concise, how about:
> 
> 	if (!apic->apicv_active)
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 	else
> 		kvm_apic_update_hwapic_isr(vcpu);

Thank you very much for reminder.

I missed the scenario when vCPU is in L2. The __nested_vmx_vmexit() will not
call kvm_apic_update_hwapic_isr() unless 'update_vmcs01_hwapic_isr' is set to true.

However, can I remove the below WARN_ON_ONCE introduced by the commit
04bc93cf49d1 ("KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active
w/o VID")?

Now we need to call vmx_hwapic_isr_update() when the vCPU is running with vmcs12
VID configured.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f87c216d976d..d263dbf0b917 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6878,15 +6878,6 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int
max_isr)
         * VM-Exit, otherwise L1 with run with a stale SVI.
         */
        if (is_guest_mode(vcpu)) {
-               /*
-                * KVM is supposed to forward intercepted L2 EOIs to L1 if VID
-                * is enabled in vmcs12; as above, the EOIs affect L2's vAPIC.
-                * Note, userspace can stuff state while L2 is active; assert
-                * that VID is disabled if and only if the vCPU is in KVM_RUN
-                * to avoid false positives if userspace is setting APIC state.
-                */
-               WARN_ON_ONCE(vcpu->wants_to_run &&
-                            nested_cpu_has_vid(get_vmcs12(vcpu)));
                to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true;
                return;
        }


Otherwise, we may encounter below WARNING.

[ 2510.134035] WARNING: CPU: 16 PID: 43475 at arch/x86/kvm/vmx/vmx.c:6889
vmx_hwapic_isr_update+0x1bf/0x270 [kvm_intel]
... ...
[ 2510.293290] Call Trace:
[ 2510.296090]  <TASK>
[ 2510.298509]  __kvm_vcpu_update_apicv+0x1c4/0x230 [kvm]
[ 2510.304432]  vcpu_enter_guest+0x3a1f/0x48a0 [kvm]
[ 2510.309827]  ? __pfx_vcpu_enter_guest+0x10/0x10 [kvm]
[ 2510.315612]  ? vmx_get_rflags+0x21/0x180 [kvm_intel]
[ 2510.321313]  ? kvm_cpu_has_interrupt+0x7d/0xe0 [kvm]
[ 2510.327047]  kvm_arch_vcpu_ioctl_run+0x8ce/0x1d70 [kvm]
[ 2510.333060]  kvm_vcpu_ioctl+0xabb/0x1060 [kvm]
[ 2510.338156]  ? __pfx_kvm_vcpu_ioctl+0x10/0x10 [kvm]
[ 2510.343746]  ? __pfx_file_has_perm+0x10/0x10
[ 2510.348625]  ? futex_wake+0x14b/0x580
[ 2510.352800]  ? futex_wait+0xc4/0x150
[ 2510.356877]  ? __pfx_do_vfs_ioctl+0x10/0x10
[ 2510.361640]  ? lock_vma_under_rcu+0x282/0x5f0
[ 2510.366648]  ? __pfx_vfs_write+0x10/0x10
[ 2510.371126]  ? do_futex+0x16c/0x240
[ 2510.375099]  ? __pfx_ioctl_has_perm.constprop.76+0x10/0x10
[ 2510.381366]  ? fdget_pos+0x391/0x4c0
[ 2510.391262]  ? fput+0x24/0x70
[ 2510.400552]  __x64_sys_ioctl+0x130/0x1a0
[ 2510.410858]  do_syscall_64+0x50/0xfa0
[ 2510.420872]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

> 
> And the comment below can be extended to state that when APICv gets enabled,
> updating SVI is necessary; otherwise, SVI won't reflect the highest bit in vISR
> and the next EOI from the guest won't be virtualized correctly, as the CPU will
> clear the SVI bit from vISR.

I will add the comment.

Thank you very much!

Dongli Zhang

> 
>> 	/*
>> 	 * When APICv gets disabled, we may still have injected interrupts
>> 	 * pending. At the same time, KVM_REQ_EVENT may not be set as APICv was
>> -- 
>> 2.39.3
>>
>>