linux-kernel - Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from apicv_inhibit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <604c0d57-ed91-44d2-80d7-4d3710b04142@redhat.com>
Date: Wed, 5 Feb 2025 12:36:21 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>,
 Maxim Levitsky <mlevitsk@...hat.com>
Cc: "Naveen N Rao (AMD)" <naveen@...nel.org>, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
 Vasant Hegde <vasant.hegde@....com>, Vitaly Kuznetsov <vkuznets@...hat.com>
Subject: Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from
 apicv_inhibit_reasons

On 2/4/25 20:18, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Maxim Levitsky wrote:
>> On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
>>> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
>>>> On 2/3/25 19:45, Sean Christopherson wrote:
>>>>> Unless there's a very, very good reason to support a use case that generates
>>>>> ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
>>>>> ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
>>>>> clear it.
>>>>
>>>> BIOS tends to use PIT, so that may be too much.  With respect to Naveen's report
>>>> of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
>>>> it to APICV_INHIBIT_REASON_PIT_REINJ.
>>>
>>> That won't work, at least not with yet more changes, because KVM creates the
>>> in-kernel PIT with reinjection enabled by default.  The stick-bit idea is that
>>> if a bit is set and can never be cleared, then there's no need to track new
>>> updates.  Since userspace needs to explicitly disable reinjection, the inhibit
>>> can't be sticky.
>> I confirmed this with a trace, this is indeed the case.
>>
>>>
>>> I assume We could fudge around that easily enough by deferring the inhibit until
>>> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
>>> I/O APIC case.
>>>
>>>> I don't love adding another inhibit reason but, together, these two should
>>>> remove the contention on apicv_update_lock.  Another idea could be to move
>>>> IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
>>
>> I retract this statement, it was based on my knowledge from back when I
>> implemented it.
>>
>> Looking at the current code again, this should be possible and can be a nice
>> cleanup regardless.
>>
>> (Or I just might have forgotten the reason that made me think back then that
>> this is not worth it, because I do remember well that I wanted to make IRQWIN
>> inhibit to be per vcpu)
> 
> The complication is the APIC page.  That's not a problem for vCPUs running in L2
> because they'll use a different MMU, i.e. a different set of SPTEs that never map
> the APIC backing page.  At least, that's how it's supposed to work[*].  ;-)
> 
> For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
> place and so KVM will fail to intercept accesses to the APIC page.  And making
> IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
> other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
> and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
> only one vCPU will be receiving ExtInt IRQs.  I.e. I don't think trying to make
> IRQWIN a pure per-vCPU inhibit is necessary for performance.
> 
> After fiddling with a bunch of ideas, I think the best approach to address both
> issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
> IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
> inhibit changes, and then force an immediate update if and only if the count hits
> a 0<=>1 transition *and* there is no other inhibit.  That would address the flaw
> Naveen found without needing to make PIT_REINJ sticky.
> 
> Guarding the count with apicv_update_lock held for read ensures that if there is
> a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
> will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
> will set IRQWIN accordingly.
> 
> Compile tested only (and probably needs to be split into multiple patches).  I'll
> try to take it for a spin later today.
> 
> [*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@google.com
> 
> ---
>   arch/x86/include/asm/kvm_host.h | 13 ++++++++++
>   arch/x86/kvm/svm/svm.c          | 43 +++++++++------------------------
>   arch/x86/kvm/svm/svm.h          |  1 +
>   arch/x86/kvm/x86.c              | 36 ++++++++++++++++++++++++++-
>   4 files changed, 61 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce1..9e3465e70a0a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1365,6 +1365,7 @@ struct kvm_arch {
>   	/* Protects apicv_inhibit_reasons */
>   	struct rw_semaphore apicv_update_lock;
>   	unsigned long apicv_inhibit_reasons;
> +	atomic_t apicv_irq_window;
>   
>   	gpa_t wall_clock;
>   
> @@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
>   	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
>   }
>   
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
> +
> +static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, true);
> +}
> +
> +static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
> +{
> +	kvm_inc_or_dec_irq_window_inhibit(kvm, false);
> +}
> +
>   unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>   				      unsigned long a0, unsigned long a1,
>   				      unsigned long a2, unsigned long a3,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..668db3bfff3d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>   	struct vmcb_control_area *control;
>   
>   	/*
> -	 * The following fields are ignored when AVIC is enabled
> +	 * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
> +	 * inhibited to detect the interrupt window.
>   	 */
> -	WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
> +	if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = true;
> +		kvm_inc_apicv_irq_window(svm->vcpu.kvm);
> +	}
>   
>   	svm_set_intercept(svm, INTERCEPT_VINTR);
>   
> @@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>   
>   static void svm_clear_vintr(struct vcpu_svm *svm)
>   {
> +	if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
> +		svm->avic_irq_window = false;
> +		kvm_dec_apicv_irq_window(svm->vcpu.kvm);
> +	}
> +
>   	svm_clr_intercept(svm, INTERCEPT_VINTR);
>   
>   	/* Drop int_ctl fields related to VINTR injection.  */
> @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
>   	kvm_make_request(KVM_REQ_EVENT, vcpu);
>   	svm_clear_vintr(to_svm(vcpu));
>   
> -	/*
> -	 * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> -	 * In this case AVIC was temporarily disabled for
> -	 * requesting the IRQ window and we have to re-enable it.
> -	 *
> -	 * If running nested, still remove the VM wide AVIC inhibit to
> -	 * support case in which the interrupt window was requested when the
> -	 * vCPU was not running nested.
> -
> -	 * All vCPUs which run still run nested, will remain to have their
> -	 * AVIC still inhibited due to per-cpu AVIC inhibition.
> -	 */
> -	kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
>   	++vcpu->stat.irq_window_exits;
>   	return 1;
>   }
> @@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
>   	 * enabled, the STGI interception will not occur. Enable the irq
>   	 * window under the assumption that the hardware will set the GIF.
>   	 */
> -	if (vgif || gif_set(svm)) {
> -		/*
> -		 * IRQ window is not needed when AVIC is enabled,
> -		 * unless we have pending ExtINT since it cannot be injected
> -		 * via AVIC. In such case, KVM needs to temporarily disable AVIC,
> -		 * and fallback to injecting IRQ via V_IRQ.
> -		 *
> -		 * If running nested, AVIC is already locally inhibited
> -		 * on this vCPU, therefore there is no need to request
> -		 * the VM wide AVIC inhibition.
> -		 */
> -		if (!is_guest_mode(vcpu))
> -			kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
> +	if (vgif || gif_set(svm))
>   		svm_set_vintr(svm);
> -	}
>   }
>   
>   static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf87..8eefed0a865a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -323,6 +323,7 @@ struct vcpu_svm {
>   
>   	bool guest_state_loaded;
>   
> +	bool avic_irq_window;
>   	bool x2avic_msrs_intercepted;
>   
>   	/* Guest GIF value, used when vGIF is not enabled */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b2d9a16fd4d3..7388f4cfe468 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>   
>   	old = new = kvm->arch.apicv_inhibit_reasons;
>   
> -	set_or_clear_apicv_inhibit(&new, reason, set);
> +	if (reason != APICV_INHIBIT_REASON_IRQWIN)
> +		set_or_clear_apicv_inhibit(&new, reason, set);
> +
> +	set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
> +				   atomic_read(&kvm->arch.apicv_irq_window));
>   
>   	if (!!old != !!new) {
>   		/*
> @@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
>   
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
> +{
> +	bool toggle;
> +
> +	/*
> +	 * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
> +	 * needs to manually inject IRQs and thus detect interrupt windows if
> +	 * APICv is disabled/inhibitied.  To avoid thrashing if the IRQ window
> +	 * is being requested because APICv is already inhibited, toggle the
> +	 * actual inhibit (and take the lock for write) if and only if there's
> +	 * no other inhibit.  KVM evaluates the IRQ window count when _any_
> +	 * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
> +	 * on the next inhibit change (if one ever occurs).
> +	 */
> +	down_read(&kvm->arch.apicv_update_lock);
> +
> +	if (inc)
> +		toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
> +	else
> +		toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
> +
> +	toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
> +
> +	up_read(&kvm->arch.apicv_update_lock);
> +
> +	if (toggle)
> +		kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);

I'm not super confident in breaking the critical section...  Another possibility:

void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
{
         int add = inc ? 1 : -1;

	if (!enable_apicv)
		return;

         /*
         * IRQ windows happen either because of ExtINT injections, or because
	* APICv is already disabled/inhibited for another reason.  While ExtINT
	* injections are rare and should not happen while the vCPU is running
	* its actual workload, it's worth avoiding thrashing if the IRQ window
         * is being requested because APICv is already inhibited.  So, toggle the
         * the actual inhibit (which requires taking the lock forwrite) if and
	* only if there's no other inhibit.  kvm_set_or_clear_apicv_inhibit()
         * always evaluates the IRQ window count; thus the IRQ window inhibit
         * call _will_ be lazily updated on the next call, if it ever happens.
         */
         if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
                 guard(rwsem_read)(&kvm->arch.apicv_update_lock);
                 if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
                         atomic_add(add, &kvm->arch.apicv_irq_window);
                         return;
                 }
         }

	/*
	 * Strictly speaking the lock is only needed if going 0->1 or 1->0,
	 * a la atomic_dec_and_mutex_lock.  However, ExtINTs are rare and
	 * only target a single CPU, so that is the common case; do not
	 * bother eliding the down_write()/up_write() pair.
	 */
         guard(rwsem_write)(&kvm->arch.apicv_update_lock);
         if (atomic_add_return(add, &kvm->arch.apicv_irq_window) == inc)
                 __kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
}
EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);

Paolo