linux-kernel - Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXI1EAolDjVbp_9W@blrnaveerao1>
Date: Thu, 22 Jan 2026 20:19:30 +0530
From: Naveen N Rao <naveen@...nel.org>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Maxim Levitsky <mlevitsk@...hat.com>, 
	Vasant Hegde <vasant.hegde@....com>, Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across
 multiple vCPUs

On Wed, Jan 14, 2026 at 11:55:57AM -0800, Sean Christopherson wrote:
> Finally mustered up the brainpower to land this series :-)

Yay! :)

> 
> On Fri, Jul 18, 2025, Naveen N Rao (AMD) wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f19a76d3ca0e..b781b4f1d304 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1395,6 +1395,10 @@ struct kvm_arch {
> >  	struct kvm_pit *vpit;
> >  #endif
> >  	atomic_t vapics_in_nmi_mode;
> > +
> > +	/* Keep this in a cacheline separate from apicv_update_lock */
> 
> A comment won't suffice.  To isolate what we want to isolate, tag things with
> __aligned().  Ideally we would use __cacheline_aligned_in_smp, but AFAIK that
> can't be used within a struct as it uses .section tags, :-(
> 
> And revisiting your analysis from
> https://lore.kernel.org/all/evszbck4u7afiu7lkafwcu3rs6a7io2zkv53rygrgz544op4ur@m2bugote2wdl:
> 
>  : Also, note that introducing apicv_irq_window after apicv_inhibit_reasons 
>  : is degrading performance in the AVIC disabled case too. So, it is likely 
>  : that some other cacheline below apicv_inhibit_reasons in kvm_arch may 
>  : also be contributing to this.
> 
> I strongly suspect past-you were correct: the problem isn't that apicv_nr_irq_window_req
> is in the same cacheline with apicv_update_lock, it's that apicv_nr_irq_window_req
> landed in the same cachline as _other_ stuff.
> 
> Looking at the struct layout from kvm-x86-next-2025.01.14, putting apicv_irq_window
> after apicv_inhibit_reasons _did_ put it on a separate cacheline from
> apicv_update_lock:

I suppose you meant kvm-x86-next-2026.01.14 (2026 and not 2025). I'm 
fairly certain that when I tested this, all three of apicv_update_lock, 
apicv_inhibit_reasons and the irq_window count were ending up in the 
same cacheline. I specifically tested moving each of those out to a 
separate cacheline (including apicv_inhibit_reasons), but as far as I 
remember, the only time I noticed a difference was when moving the 
irq_window count elsewhere.
> 
> 	/* --- cacheline 517 boundary (33088 bytes) was 24 bytes ago --- */
> 	struct kvm_apic_map *      apic_map;             /* 33112     8 */
> 	atomic_t                   apic_map_dirty;       /* 33120     4 */
> 	bool                       apic_access_memslot_enabled; /* 33124     1 */
> 	bool                       apic_access_memslot_inhibited; /* 33125     1 */
> 
> 	/* XXX 2 bytes hole, try to pack */
> 
> 	struct rw_semaphore        apicv_update_lock;    /* 33128   152 */
> 
> 	/* XXX last struct has 1 hole */
> 
> 	/* --- cacheline 520 boundary (33280 bytes) --- */
> 	unsigned long              apicv_inhibit_reasons; /* 33280     8 */
> 	atomic_t                   apicv_irq_window;     /* 33288     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	gpa_t                      wall_clock;           /* 33296     8 */
> 	bool                       mwait_in_guest;       /* 33304     1 */
> 	bool                       hlt_in_guest;         /* 33305     1 */
> 	bool                       pause_in_guest;       /* 33306     1 
> 	*/
> 	bool                       cstate_in_guest;      /* 33307     1 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	unsigned long              irq_sources_bitmap;   /* 33312     8 */
> 	s64                        kvmclock_offset;      /* 33320     8 */
> 	raw_spinlock_t             tsc_write_lock;       /* 33328    64 */
> 	/* --- cacheline 521 boundary (33344 bytes) was 48 bytes ago --- */
> 
> 
> Which fits with my reaction that the irq_window counter being in the same cachline
> as apicv_update_lock shouldn't be problematic, because the counter is only ever
> written while holding the lock.  I.e. the counter is written only when the lock
> cacheline is likely already pulled in in an exclusive state.

Indeed.

> 
> What appears to be problematic is that the counter is in the same cacheline as
> several relatively hot read-mostly fields:
> 
>   apicv_inhibit_reasons - read by every vCPU on every VM-Enter
>   xxx_in_guest (now disabled_exits) - read on page faults, if a vCPU 
>   takes a PAUSE exit, if a vCPU is scheduled out, etc.
>   kvmclock_offset - read every time a vCPU needs to refresh kvmclock
> 
> So I actually think we want apicv_update_lock and apicv_nr_irq_window_req to
> _share_ a cacheline, and then isolate that cacheline from everything else.  Because
> those two fields are effectively write-mostly, whereas most things in kvm-arch are
> read-mostly.  I.e. end up with this:
> 
> 	/*
> 	 * Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an
> 	 * asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details).
> 	 *
> 	 * Force apicv_update_lock and apicv_nr_irq_window_req to reside in a
> 	 * dedicated cacheline.  They are write-mostly, whereas most everything
> 	 * else in kvm_arch is read-mostly.
> 	 */
> 	struct rw_semaphore apicv_update_lock __aligned(L1_CACHE_BYTES);
> 	atomic_t apicv_nr_irq_window_req;
> 
> 	/*
> 	 * As above, isolate apicv_update_lock and apicv_nr_irq_window_req on
> 	 * their own cacheline.  Note that apicv_inhibit_reasons is read-mostly
> 	 * even though it's protected by apicv_update_lock (toggling VM-wide
> 	 * inhibits is rare; _checking_ for inhibits is common).
> 	 */
> 	unsigned long apicv_inhibit_reasons __aligned(L1_CACHE_BYTES);

Nice, isolating those in a separate cacheline looks to be helping.

> 
> I also want to land the optimization separately, so that it can be properly
> documented, justified, and analyzed by others.
> 
> I pushed a rebased version (compile-tested only at this time) with the above change to:
> 
>   https://github.com/sean-jc/linux.git svm/avic_irq_window
> 
> Can you run you perf tests to see if that aproach also eliminates the degredation
> relative to avic=0 that you observed?

Yes, this definitely seems to be helping get rid of that odd performance 
drop I was seeing earlier. I'll run a couple more tests and report back 
by next week if I see anything off. Otherwise, this is looking good to 
me and if you want to apply this to -next, I'm fine with that:
Tested-by: Naveen N Rao (AMD) <naveen@...nel.org>


Thanks for all your help with this (and Paolo)!


- Naveen