[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <604c0d57-ed91-44d2-80d7-4d3710b04142@redhat.com>
Date: Wed, 5 Feb 2025 12:36:21 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>,
Maxim Levitsky <mlevitsk@...hat.com>
Cc: "Naveen N Rao (AMD)" <naveen@...nel.org>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org,
Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
Vasant Hegde <vasant.hegde@....com>, Vitaly Kuznetsov <vkuznets@...hat.com>
Subject: Re: [PATCH 3/3] KVM: x86: Decouple APICv activation state from
apicv_inhibit_reasons
On 2/4/25 20:18, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Maxim Levitsky wrote:
>> On Mon, 2025-02-03 at 15:46 -0800, Sean Christopherson wrote:
>>> On Mon, Feb 03, 2025, Paolo Bonzini wrote:
>>>> On 2/3/25 19:45, Sean Christopherson wrote:
>>>>> Unless there's a very, very good reason to support a use case that generates
>>>>> ExtInts during boot, but _only_ during boot, and otherwise doesn't have any APICv
>>>>> ihibits, I'm leaning towards making SVM's IRQ window inhibit sticky, i.e. never
>>>>> clear it.
>>>>
>>>> BIOS tends to use PIT, so that may be too much. With respect to Naveen's report
>>>> of contention on apicv_update_lock, I would go with the sticky-bit idea but apply
>>>> it to APICV_INHIBIT_REASON_PIT_REINJ.
>>>
>>> That won't work, at least not with yet more changes, because KVM creates the
>>> in-kernel PIT with reinjection enabled by default. The stick-bit idea is that
>>> if a bit is set and can never be cleared, then there's no need to track new
>>> updates. Since userspace needs to explicitly disable reinjection, the inhibit
>>> can't be sticky.
>> I confirmed this with a trace, this is indeed the case.
>>
>>>
>>> I assume We could fudge around that easily enough by deferring the inhibit until
>>> a vCPU is created (or run?), but piggybacking PIT_REINJ won't help the userspace
>>> I/O APIC case.
>>>
>>>> I don't love adding another inhibit reason but, together, these two should
>>>> remove the contention on apicv_update_lock. Another idea could be to move
>>>> IRQWIN to per-vCPU reason but Maxim tells me that it's not so easy.
>>
>> I retract this statement, it was based on my knowledge from back when I
>> implemented it.
>>
>> Looking at the current code again, this should be possible and can be a nice
>> cleanup regardless.
>>
>> (Or I just might have forgotten the reason that made me think back then that
>> this is not worth it, because I do remember well that I wanted to make IRQWIN
>> inhibit to be per vcpu)
>
> The complication is the APIC page. That's not a problem for vCPUs running in L2
> because they'll use a different MMU, i.e. a different set of SPTEs that never map
> the APIC backing page. At least, that's how it's supposed to work[*]. ;-)
>
> For IRQWIN, turning off APICv for the current vCPU will leave the APIC SPTEs in
> place and so KVM will fail to intercept accesses to the APIC page. And making
> IRQWIN a per-vCPU inhibit won't help performance in the case where there is no
> other inhibit, because (a) toggling it on/off requires taking mmu_lock for writing
> and doing a remote TLB flush, and (b) unless the guest is doing something bizarre,
> only one vCPU will be receiving ExtInt IRQs. I.e. I don't think trying to make
> IRQWIN a pure per-vCPU inhibit is necessary for performance.
>
> After fiddling with a bunch of ideas, I think the best approach to address both
> issues is to add a counter for the IRQ window (addresses the per-vCPU aspect of
> IRQ windows), set/clear the IRQWIN inhibit according to the counter when *any*
> inhibit changes, and then force an immediate update if and only if the count hits
> a 0<=>1 transition *and* there is no other inhibit. That would address the flaw
> Naveen found without needing to make PIT_REINJ sticky.
>
> Guarding the count with apicv_update_lock held for read ensures that if there is
> a racing change to a different inhibit, that either kvm_inc_or_dec_irq_window_inhibit()
> will see no inhibits and go down the slow path, or __kvm_set_or_clear_apicv_inhibit()
> will set IRQWIN accordingly.
>
> Compile tested only (and probably needs to be split into multiple patches). I'll
> try to take it for a spin later today.
>
> [*] https://lore.kernel.org/all/20250130010825.220346-1-seanjc@google.com
>
> ---
> arch/x86/include/asm/kvm_host.h | 13 ++++++++++
> arch/x86/kvm/svm/svm.c | 43 +++++++++------------------------
> arch/x86/kvm/svm/svm.h | 1 +
> arch/x86/kvm/x86.c | 36 ++++++++++++++++++++++++++-
> 4 files changed, 61 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5193c3dfbce1..9e3465e70a0a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1365,6 +1365,7 @@ struct kvm_arch {
> /* Protects apicv_inhibit_reasons */
> struct rw_semaphore apicv_update_lock;
> unsigned long apicv_inhibit_reasons;
> + atomic_t apicv_irq_window;
>
> gpa_t wall_clock;
>
> @@ -2203,6 +2204,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
> kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
> }
>
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
> +
> +static inline void kvm_inc_apicv_irq_window(struct kvm *kvm)
> +{
> + kvm_inc_or_dec_irq_window_inhibit(kvm, true);
> +}
> +
> +static inline void kvm_dec_apicv_irq_window(struct kvm *kvm)
> +{
> + kvm_inc_or_dec_irq_window_inhibit(kvm, false);
> +}
> +
> unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> unsigned long a0, unsigned long a1,
> unsigned long a2, unsigned long a3,
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..668db3bfff3d 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1636,9 +1636,13 @@ static void svm_set_vintr(struct vcpu_svm *svm)
> struct vmcb_control_area *control;
>
> /*
> - * The following fields are ignored when AVIC is enabled
> + * vIRQ is ignored by hardware AVIC is enabled, and so AVIC must be
> + * inhibited to detect the interrupt window.
> */
> - WARN_ON(kvm_vcpu_apicv_activated(&svm->vcpu));
> + if (enable_apicv && !is_guest_mode(&svm->vcpu)) {
> + svm->avic_irq_window = true;
> + kvm_inc_apicv_irq_window(svm->vcpu.kvm);
> + }
>
> svm_set_intercept(svm, INTERCEPT_VINTR);
>
> @@ -1666,6 +1670,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
>
> static void svm_clear_vintr(struct vcpu_svm *svm)
> {
> + if (svm->avic_irq_window && !is_guest_mode(&svm->vcpu)) {
> + svm->avic_irq_window = false;
> + kvm_dec_apicv_irq_window(svm->vcpu.kvm);
> + }
> +
> svm_clr_intercept(svm, INTERCEPT_VINTR);
>
> /* Drop int_ctl fields related to VINTR injection. */
> @@ -3219,20 +3228,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> svm_clear_vintr(to_svm(vcpu));
>
> - /*
> - * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
> - * In this case AVIC was temporarily disabled for
> - * requesting the IRQ window and we have to re-enable it.
> - *
> - * If running nested, still remove the VM wide AVIC inhibit to
> - * support case in which the interrupt window was requested when the
> - * vCPU was not running nested.
> -
> - * All vCPUs which run still run nested, will remain to have their
> - * AVIC still inhibited due to per-cpu AVIC inhibition.
> - */
> - kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
> ++vcpu->stat.irq_window_exits;
> return 1;
> }
> @@ -3879,22 +3874,8 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
> * enabled, the STGI interception will not occur. Enable the irq
> * window under the assumption that the hardware will set the GIF.
> */
> - if (vgif || gif_set(svm)) {
> - /*
> - * IRQ window is not needed when AVIC is enabled,
> - * unless we have pending ExtINT since it cannot be injected
> - * via AVIC. In such case, KVM needs to temporarily disable AVIC,
> - * and fallback to injecting IRQ via V_IRQ.
> - *
> - * If running nested, AVIC is already locally inhibited
> - * on this vCPU, therefore there is no need to request
> - * the VM wide AVIC inhibition.
> - */
> - if (!is_guest_mode(vcpu))
> - kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
> -
> + if (vgif || gif_set(svm))
> svm_set_vintr(svm);
> - }
> }
>
> static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 9d7cdb8fbf87..8eefed0a865a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -323,6 +323,7 @@ struct vcpu_svm {
>
> bool guest_state_loaded;
>
> + bool avic_irq_window;
> bool x2avic_msrs_intercepted;
>
> /* Guest GIF value, used when vGIF is not enabled */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b2d9a16fd4d3..7388f4cfe468 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10604,7 +10604,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
>
> old = new = kvm->arch.apicv_inhibit_reasons;
>
> - set_or_clear_apicv_inhibit(&new, reason, set);
> + if (reason != APICV_INHIBIT_REASON_IRQWIN)
> + set_or_clear_apicv_inhibit(&new, reason, set);
> +
> + set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
> + atomic_read(&kvm->arch.apicv_irq_window));
>
> if (!!old != !!new) {
> /*
> @@ -10645,6 +10649,36 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
> }
> EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
>
> +void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
> +{
> + bool toggle;
> +
> + /*
> + * The IRQ window inhibit has a cyclical dependency of sorts, as KVM
> + * needs to manually inject IRQs and thus detect interrupt windows if
> + * APICv is disabled/inhibitied. To avoid thrashing if the IRQ window
> + * is being requested because APICv is already inhibited, toggle the
> + * actual inhibit (and take the lock for write) if and only if there's
> + * no other inhibit. KVM evaluates the IRQ window count when _any_
> + * inhibit changes, i.e. the IRQ window inhibit can be lazily updated
> + * on the next inhibit change (if one ever occurs).
> + */
> + down_read(&kvm->arch.apicv_update_lock);
> +
> + if (inc)
> + toggle = atomic_inc_return(&kvm->arch.apicv_irq_window) == 1;
> + else
> + toggle = atomic_dec_return(&kvm->arch.apicv_irq_window) == 0;
> +
> + toggle = toggle && !(kvm->arch.apicv_inhibit_reasons & ~BIT(APICV_INHIBIT_REASON_IRQWIN));
> +
> + up_read(&kvm->arch.apicv_update_lock);
> +
> + if (toggle)
> + kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
I'm not super confident in breaking the critical section... Another possibility:
void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
{
int add = inc ? 1 : -1;
if (!enable_apicv)
return;
/*
* IRQ windows happen either because of ExtINT injections, or because
* APICv is already disabled/inhibited for another reason. While ExtINT
* injections are rare and should not happen while the vCPU is running
* its actual workload, it's worth avoiding thrashing if the IRQ window
* is being requested because APICv is already inhibited. So, toggle the
* the actual inhibit (which requires taking the lock forwrite) if and
* only if there's no other inhibit. kvm_set_or_clear_apicv_inhibit()
* always evaluates the IRQ window count; thus the IRQ window inhibit
* call _will_ be lazily updated on the next call, if it ever happens.
*/
if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
guard(rwsem_read)(&kvm->arch.apicv_update_lock);
if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
atomic_add(add, &kvm->arch.apicv_irq_window);
return;
}
}
/*
* Strictly speaking the lock is only needed if going 0->1 or 1->0,
* a la atomic_dec_and_mutex_lock. However, ExtINTs are rare and
* only target a single CPU, so that is the common case; do not
* bother eliding the down_write()/up_write() pair.
*/
guard(rwsem_write)(&kvm->arch.apicv_update_lock);
if (atomic_add_return(add, &kvm->arch.apicv_irq_window) == inc)
__kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
}
EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);
Paolo
Powered by blists - more mailing lists