[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXI1EAolDjVbp_9W@blrnaveerao1>
Date: Thu, 22 Jan 2026 20:19:30 +0530
From: Naveen N Rao <naveen@...nel.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, Maxim Levitsky <mlevitsk@...hat.com>,
Vasant Hegde <vasant.hegde@....com>, Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across
multiple vCPUs
On Wed, Jan 14, 2026 at 11:55:57AM -0800, Sean Christopherson wrote:
> Finally mustered up the brainpower to land this series :-)
Yay! :)
>
> On Fri, Jul 18, 2025, Naveen N Rao (AMD) wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f19a76d3ca0e..b781b4f1d304 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1395,6 +1395,10 @@ struct kvm_arch {
> > struct kvm_pit *vpit;
> > #endif
> > atomic_t vapics_in_nmi_mode;
> > +
> > + /* Keep this in a cacheline separate from apicv_update_lock */
>
> A comment won't suffice. To isolate what we want to isolate, tag things with
> __aligned(). Ideally we would use __cacheline_aligned_in_smp, but AFAIK that
> can't be used within a struct as it uses .section tags, :-(
>
> And revisiting your analysis from
> https://lore.kernel.org/all/evszbck4u7afiu7lkafwcu3rs6a7io2zkv53rygrgz544op4ur@m2bugote2wdl:
>
> : Also, note that introducing apicv_irq_window after apicv_inhibit_reasons
> : is degrading performance in the AVIC disabled case too. So, it is likely
> : that some other cacheline below apicv_inhibit_reasons in kvm_arch may
> : also be contributing to this.
>
> I strongly suspect past-you were correct: the problem isn't that apicv_nr_irq_window_req
> is in the same cacheline with apicv_update_lock, it's that apicv_nr_irq_window_req
> landed in the same cachline as _other_ stuff.
>
> Looking at the struct layout from kvm-x86-next-2025.01.14, putting apicv_irq_window
> after apicv_inhibit_reasons _did_ put it on a separate cacheline from
> apicv_update_lock:
I suppose you meant kvm-x86-next-2026.01.14 (2026 and not 2025). I'm
fairly certain that when I tested this, all three of apicv_update_lock,
apicv_inhibit_reasons and the irq_window count were ending up in the
same cacheline. I specifically tested moving each of those out to a
separate cacheline (including apicv_inhibit_reasons), but as far as I
remember, the only time I noticed a difference was when moving the
irq_window count elsewhere.
>
> /* --- cacheline 517 boundary (33088 bytes) was 24 bytes ago --- */
> struct kvm_apic_map * apic_map; /* 33112 8 */
> atomic_t apic_map_dirty; /* 33120 4 */
> bool apic_access_memslot_enabled; /* 33124 1 */
> bool apic_access_memslot_inhibited; /* 33125 1 */
>
> /* XXX 2 bytes hole, try to pack */
>
> struct rw_semaphore apicv_update_lock; /* 33128 152 */
>
> /* XXX last struct has 1 hole */
>
> /* --- cacheline 520 boundary (33280 bytes) --- */
> unsigned long apicv_inhibit_reasons; /* 33280 8 */
> atomic_t apicv_irq_window; /* 33288 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> gpa_t wall_clock; /* 33296 8 */
> bool mwait_in_guest; /* 33304 1 */
> bool hlt_in_guest; /* 33305 1 */
> bool pause_in_guest; /* 33306 1
> */
> bool cstate_in_guest; /* 33307 1 */
>
> /* XXX 4 bytes hole, try to pack */
>
> unsigned long irq_sources_bitmap; /* 33312 8 */
> s64 kvmclock_offset; /* 33320 8 */
> raw_spinlock_t tsc_write_lock; /* 33328 64 */
> /* --- cacheline 521 boundary (33344 bytes) was 48 bytes ago --- */
>
>
> Which fits with my reaction that the irq_window counter being in the same cachline
> as apicv_update_lock shouldn't be problematic, because the counter is only ever
> written while holding the lock. I.e. the counter is written only when the lock
> cacheline is likely already pulled in in an exclusive state.
Indeed.
>
> What appears to be problematic is that the counter is in the same cacheline as
> several relatively hot read-mostly fields:
>
> apicv_inhibit_reasons - read by every vCPU on every VM-Enter
> xxx_in_guest (now disabled_exits) - read on page faults, if a vCPU
> takes a PAUSE exit, if a vCPU is scheduled out, etc.
> kvmclock_offset - read every time a vCPU needs to refresh kvmclock
>
> So I actually think we want apicv_update_lock and apicv_nr_irq_window_req to
> _share_ a cacheline, and then isolate that cacheline from everything else. Because
> those two fields are effectively write-mostly, whereas most things in kvm-arch are
> read-mostly. I.e. end up with this:
>
> /*
> * Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an
> * asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details).
> *
> * Force apicv_update_lock and apicv_nr_irq_window_req to reside in a
> * dedicated cacheline. They are write-mostly, whereas most everything
> * else in kvm_arch is read-mostly.
> */
> struct rw_semaphore apicv_update_lock __aligned(L1_CACHE_BYTES);
> atomic_t apicv_nr_irq_window_req;
>
> /*
> * As above, isolate apicv_update_lock and apicv_nr_irq_window_req on
> * their own cacheline. Note that apicv_inhibit_reasons is read-mostly
> * even though it's protected by apicv_update_lock (toggling VM-wide
> * inhibits is rare; _checking_ for inhibits is common).
> */
> unsigned long apicv_inhibit_reasons __aligned(L1_CACHE_BYTES);
Nice, isolating those in a separate cacheline looks to be helping.
>
> I also want to land the optimization separately, so that it can be properly
> documented, justified, and analyzed by others.
>
> I pushed a rebased version (compile-tested only at this time) with the above change to:
>
> https://github.com/sean-jc/linux.git svm/avic_irq_window
>
> Can you run you perf tests to see if that aproach also eliminates the degredation
> relative to avic=0 that you observed?
Yes, this definitely seems to be helping get rid of that odd performance
drop I was seeing earlier. I'll run a couple more tests and report back
by next week if I see anything off. Otherwise, this is looking good to
me and if you want to apply this to -next, I'm fine with that:
Tested-by: Naveen N Rao (AMD) <naveen@...nel.org>
Thanks for all your help with this (and Paolo)!
- Naveen
Powered by blists - more mailing lists