[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aVQ-MNmUa1fb83zH@google.com>
Date: Tue, 30 Dec 2025 13:03:44 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Dongli Zhang <dongli.zhang@...cle.com>
Subject: Re: [PATCH v3 06/10] KVM: nVMX: Switch to vmcs01 to update SVI
on-demand if L2 is active
On Thu, Dec 25, 2025, Chao Gao wrote:
> On Fri, Dec 05, 2025 at 03:19:09PM -0800, Sean Christopherson wrote:
> >@@ -6963,21 +6963,16 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> > u16 status;
> > u8 old;
> >
> >- /*
> >- * If L2 is active, defer the SVI update until vmcs01 is loaded, as SVI
> >- * is only relevant for if and only if Virtual Interrupt Delivery is
> >- * enabled in vmcs12, and if VID is enabled then L2 EOIs affect L2's
> >- * vAPIC, not L1's vAPIC. KVM must update vmcs01 on the next nested
> >- * VM-Exit, otherwise L1 with run with a stale SVI.
> >- */
> >- if (is_guest_mode(vcpu)) {
> >- to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true;
> >- return;
> >- }
> >-
> > if (max_isr == -1)
> > max_isr = 0;
> >
> >+ /*
> >+ * Always update SVI in vmcs01, as SVI is only relevant for L2 if and
> >+ * only if Virtual Interrupt Delivery is enabled in vmcs12, and if VID
> >+ * is enabled then L2 EOIs affect L2's vAPIC, not L1's vAPIC.
> >+ */
> >+ guard(vmx_vmcs01)(vcpu);
>
> KVM calls this function when virtualizing EOI for L2, and in a previous
> discussion, you mentioned that the overhead of switching to VMCS01 is
> "non-trivial and unnecessary" (see [1]).
>
> My testing shows that guard(vmx_vmcs01) takes about 140-250 cycles. I think
> this overhead is acceptable for nested scenarios, since it only affects
> EOI-induced VM-exits in specific/suboptimal configurations.
>
> But I'm wondering whether KVM should update SVI on every VM-entry instead of
> doing it on-demand (i.e., when vISR gets changed). We've encountered two
> SVI-related bugs [1][2] that were difficult to debug. Preventing these issues
> entirely seems worthwhile, and the overhead of always updating SVI during
> VM-entry should be minimal since KVM already updates RVI (RVI and SVI are in
> the the same VMCS field) in vmx_sync_irr_to_pir() when APICv is enabled.
Hmm. At first glance, I _really_ like this idea, but I'm leaning fairly strongly
towards keeping .hwapic_isr_update().
While small (~28 cycles on EMR), the runtime cost isn't zero, and it affects the
fastpath. And number of useful updates is comically small. E.g. without a nested
VM, AFAICT they basically never happen post-boot. Even when running nested VMs,
the number of useful update when running L1 hovers around ~0.001%.
More importantly, KVM will carry most of the complexity related to vISR updates
regardless of how KVM handles SVI because of the ISR caching for non-APICv
systems. So while I acknowledge that we've had some nasty bugs and 100% agree
that squashing them entirely is _very_ enticing, I think those bugs were due to
what were effectively two systemic flaws in KVM: (1) not aligning SVI with KVM's
ISR caching code, and (2) the whole "defer updates to nested VM-Exit" mess.
At the end of this series, both (1) and (2) are "solved". Huh. And now that I
look at (1) again, the last patch is wrong (benignly wrong, but still wrong).
The changelog says this:
First, it adds a call during kvm_lapic_reset(), but that's a glorified nop as
the ISR has already been zeroed.
but that's simply not true. There's already a call in kvm_lapic_reset(). So
that patch can be amended with:
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 7be4d759884c..55a7a2be3a2e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2907,10 +2907,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
vcpu->arch.pv_eoi.msr_val = 0;
apic_update_ppr(apic);
- if (apic->apicv_active) {
+ if (apic->apicv_active)
kvm_x86_call(apicv_post_state_restore)(vcpu);
- kvm_x86_call(hwapic_isr_update)(vcpu, -1);
- }
vcpu->arch.apic_arb_prio = 0;
vcpu->arch.apic_attention = 0;
At which point updates to highest_isr_cache and .hwapic_isr_update() are fully
symmetrical (ignoring that KVM simply invalidates highest_isr_cache instead of
scanning the vISR on EOI and APICv changes).
So yeah, the more I look at all of this, the more I'm in favor of keeping
.hwapic_isr_update(), e.g. if only to let it serve as a canary for finding issues
related to highest_isr_cache and/or isr_count.
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef8d29c677b9..e7883bf7665f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6957,45 +6957,20 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> read_unlock(&vcpu->kvm->mmu_lock);
> }
>
> -void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> +static void vmx_set_rvi_svi(int rvi, int svi)
If this ever goes anywhere, my vote would be to call this vmx_sync_guest_intr_status(),
and pass in only @rvi, e.g.
static void vmx_sync_guest_intr_status(struct kvm_vcpu *vcpu, int rvi)
{
int svi = kvm_lapic_find_highest_isr(vcpu);
u16 status, new;
...
}
> status = vmcs_read16(GUEST_INTR_STATUS);
> + new = (rvi & 0xff) | ((u8)svi << 8);
I think this is technically undefined behavior? Due to a shift larger than type
(casting to an 8-bit value and then shifting by 8). svi[31:8] should always be
'0', but to be paranoid we could do:
new = (rvi & 0xff) | ((svi & 0xff) << 8);
> + if (new != status)
> + vmcs_write16(GUEST_INTR_STATUS, new);
> }
Powered by blists - more mailing lists