linux-kernel - Re: [PATCH v3 06/10] KVM: nVMX: Switch to vmcs01 to update SVI on-demand if L2 is active

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVQ-MNmUa1fb83zH@google.com>
Date: Tue, 30 Dec 2025 13:03:44 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Dongli Zhang <dongli.zhang@...cle.com>
Subject: Re: [PATCH v3 06/10] KVM: nVMX: Switch to vmcs01 to update SVI
 on-demand if L2 is active

On Thu, Dec 25, 2025, Chao Gao wrote:
> On Fri, Dec 05, 2025 at 03:19:09PM -0800, Sean Christopherson wrote:
> >@@ -6963,21 +6963,16 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> > 	u16 status;
> > 	u8 old;
> > 
> >-	/*
> >-	 * If L2 is active, defer the SVI update until vmcs01 is loaded, as SVI
> >-	 * is only relevant for if and only if Virtual Interrupt Delivery is
> >-	 * enabled in vmcs12, and if VID is enabled then L2 EOIs affect L2's
> >-	 * vAPIC, not L1's vAPIC.  KVM must update vmcs01 on the next nested
> >-	 * VM-Exit, otherwise L1 with run with a stale SVI.
> >-	 */
> >-	if (is_guest_mode(vcpu)) {
> >-		to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true;
> >-		return;
> >-	}
> >-
> > 	if (max_isr == -1)
> > 		max_isr = 0;
> > 
> >+	/*
> >+	 * Always update SVI in vmcs01, as SVI is only relevant for L2 if and
> >+	 * only if Virtual Interrupt Delivery is enabled in vmcs12, and if VID
> >+	 * is enabled then L2 EOIs affect L2's vAPIC, not L1's vAPIC.
> >+	 */
> >+	guard(vmx_vmcs01)(vcpu);
> 
> KVM calls this function when virtualizing EOI for L2, and in a previous
> discussion, you mentioned that the overhead of switching to VMCS01 is
> "non-trivial and unnecessary" (see [1]).
> 
> My testing shows that guard(vmx_vmcs01) takes about 140-250 cycles. I think
> this overhead is acceptable for nested scenarios, since it only affects
> EOI-induced VM-exits in specific/suboptimal configurations.
> 
> But I'm wondering whether KVM should update SVI on every VM-entry instead of
> doing it on-demand (i.e., when vISR gets changed). We've encountered two
> SVI-related bugs [1][2] that were difficult to debug. Preventing these issues
> entirely seems worthwhile, and the overhead of always updating SVI during
> VM-entry should be minimal since KVM already updates RVI (RVI and SVI are in
> the the same VMCS field) in vmx_sync_irr_to_pir() when APICv is enabled.

Hmm.  At first glance, I _really_ like this idea, but I'm leaning fairly strongly
towards keeping .hwapic_isr_update().

While small (~28 cycles on EMR), the runtime cost isn't zero, and it affects the
fastpath.  And number of useful updates is comically small.  E.g. without a nested
VM, AFAICT they basically never happen post-boot.  Even when running nested VMs,
the number of useful update when running L1 hovers around ~0.001%.

More importantly, KVM will carry most of the complexity related to vISR updates
regardless of how KVM handles SVI because of the ISR caching for non-APICv
systems.  So while I acknowledge that we've had some nasty bugs and 100% agree
that squashing them entirely is _very_ enticing, I think those bugs were due to
what were effectively two systemic flaws in KVM: (1) not aligning SVI with KVM's
ISR caching code, and (2) the whole "defer updates to nested VM-Exit" mess.

At the end of this series, both (1) and (2) are "solved".  Huh.  And now that I
look at (1) again, the last patch is wrong (benignly wrong, but still wrong).
The changelog says this:

  First, it adds a call during kvm_lapic_reset(), but that's a glorified nop as
  the ISR has already been zeroed.

but that's simply not true.  There's already a call in kvm_lapic_reset().  So
that patch can be amended with:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 7be4d759884c..55a7a2be3a2e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2907,10 +2907,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
 
        vcpu->arch.pv_eoi.msr_val = 0;
        apic_update_ppr(apic);
-       if (apic->apicv_active) {
+       if (apic->apicv_active)
                kvm_x86_call(apicv_post_state_restore)(vcpu);
-               kvm_x86_call(hwapic_isr_update)(vcpu, -1);
-       }
 
        vcpu->arch.apic_arb_prio = 0;
        vcpu->arch.apic_attention = 0;


At which point updates to highest_isr_cache and .hwapic_isr_update() are fully
symmetrical (ignoring that KVM simply invalidates highest_isr_cache instead of
scanning the vISR on EOI and APICv changes).

So yeah, the more I look at all of this, the more I'm in favor of keeping
.hwapic_isr_update(), e.g. if only to let it serve as a canary for finding issues
related to highest_isr_cache and/or isr_count.

> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef8d29c677b9..e7883bf7665f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6957,45 +6957,20 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> 	read_unlock(&vcpu->kvm->mmu_lock);
>  }
>  
> -void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> +static void vmx_set_rvi_svi(int rvi, int svi)

If this ever goes anywhere, my vote would be to call this vmx_sync_guest_intr_status(),
and pass in only @rvi, e.g. 

  static void vmx_sync_guest_intr_status(struct kvm_vcpu *vcpu, int rvi)
  {
	int svi = kvm_lapic_find_highest_isr(vcpu);
	u16 status, new;

	...
  }

> 	status = vmcs_read16(GUEST_INTR_STATUS);
> +	new = (rvi & 0xff) | ((u8)svi << 8);

I think this is technically undefined behavior?  Due to a shift larger than type
(casting to an 8-bit value and then shifting by 8).  svi[31:8] should always be
'0', but to be paranoid we could do:

	new = (rvi & 0xff) | ((svi & 0xff) << 8);

> +	if (new != status)
> +		vmcs_write16(GUEST_INTR_STATUS, new);
>  }