linux-kernel - Re: [PATCH] KVM: SVM: Update destination when updating pi irte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bae58fd3-34b0-641a-a18b-010d48c792f0@oracle.com>
Date:   Fri, 14 Jul 2023 14:05:27 +0100
From:   Joao Martins <joao.m.martins@...cle.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Maxim Levitsky <mlevitsk@...hat.com>,
        "dengqiao.joey" <dengqiao.joey@...edance.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        Alejandro Jimenez <alejandro.j.jimenez@...cle.com>,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte

+Suravee, +Alejandro

On 29/06/2023 23:35, Sean Christopherson wrote:
> On Thu, May 18, 2023, Joao Martins wrote:
>> On 18/05/2023 09:19, Maxim Levitsky wrote:
>>> I think that we do need to a flag indicating if the vCPU is currently
>>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
>>> running or something) (currently the vcpu->cpu remains set when vCPU is
>>> put)
>>>
>>> In other words if a vCPU is running, then avic_pi_update_irte should put
>>> correct pCPU number, and if it raced with vCPU put/load, then later should
>>> win and put the correct value.  This can be done either with a lock or
>>> barriers.
>>>
>> If this could be done, it could remove cost from other places and avoid this
>> little dance of the galog (and avoid its usage as it's not the greatest design
>> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
>> things with regards to updating any piece of the IRTE, but we need some care
>> there two to avoid invalidating too much (which is just as expensive and per-VCPU).
> 
> ...
> 
>> But still quite expensive (as many IPIs as vCPUs updated), but it works as
>> intended and guest will immediately see the right vcpu affinity. But I honestly
>> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
>> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
>> preemption/blocking of the VCPU.
> 
> I think we have all the necessary info, and even a handy dandy spinlock to ensure
> ordering.  Disclaimers: compile tested only, I know almost nothing about the IOMMU
> side of things, and I don't know if I understood the needs for the !IsRunning cases.
> 
I was avoiding grabbing that lock, but now that I think about it it shouldn't do
much harm.

My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
that is about to block as then the doorbell rang by the IOMMU won't do anything
to the guest. But IIUC the physical ID cache read-once should cover that

> Disclaimers aside, this should point the IOMMU at the right pCPU when the target
> vCPU changes and the new vCPU is actively running.
> 
Yeap, it should. I am gonna see if we can test this next week (even if not me,
but someone from my team)

> ---
>  arch/x86/kvm/svm/avic.c | 44 +++++++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index cfc8ab773025..703ad9af73eb 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -791,6 +791,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
>  	int ret = 0;
>  	unsigned long flags;
>  	struct amd_svm_iommu_ir *ir;
> +	u64 entry;
>  
>  	/**
>  	 * In some cases, the existing irte is updated and re-set,
> @@ -824,6 +825,18 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
>  	ir->data = pi->ir_data;
>  
>  	spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
> +	/*
> +	 * Update the target pCPU for IOMMU doorbells if the vCPU is running.
> +	 * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
> +	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
> +	 * See also avic_vcpu_load().
> +	 */
> +	entry = READ_ONCE(*(svm->avic_physical_id_cache));
> +	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
> +		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
> +				    true, pi->ir_data);
> +

Ah! Totally forgot about the ID cache from AVIC. And it's already paired with
barriers Maxim was alluding to. Much better than trying to get a safe read of
vcpu::cpu.

>  	list_add(&ir->node, &svm->ir_list);
>  	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>  out:
> @@ -986,10 +999,11 @@ static inline int
>  avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
>  {
>  	int ret = 0;
> -	unsigned long flags;
>  	struct amd_svm_iommu_ir *ir;
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> +	lockdep_assert_held(&svm->ir_list_lock);
> +
>  	if (!kvm_arch_has_assigned_device(vcpu->kvm))
>  		return 0;
>  
> @@ -997,19 +1011,15 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
>  	 * Here, we go through the per-vcpu ir_list to update all existing
>  	 * interrupt remapping table entry targeting this vcpu.
>  	 */
> -	spin_lock_irqsave(&svm->ir_list_lock, flags);
> -
>  	if (list_empty(&svm->ir_list))
> -		goto out;
> +		return 0;
>  
>  	list_for_each_entry(ir, &svm->ir_list, node) {
>  		ret = amd_iommu_update_ga(cpu, r, ir->data);
>  		if (ret)
> -			break;
> +			return ret;
>  	}
> -out:
> -	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> -	return ret;
> +	return 0;
>  }
>  
>  void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> @@ -1017,6 +1027,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	u64 entry;
>  	int h_physical_id = kvm_cpu_get_apicid(cpu);
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	unsigned long flags;
>  
>  	lockdep_assert_preemption_disabled();
>  
> @@ -1033,6 +1044,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	if (kvm_vcpu_is_blocking(vcpu))
>  		return;
>  
> +	/*
> +	 * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
> +	 * _currently_ have assigned devices, as that can change.  Holding
> +	 * ir_list_lock ensures that either svm_ir_list_add() will consume
> +	 * up-to-date entry information, or that this task will wait until
> +	 * svm_ir_list_add() completes to set the new target pCPU.
> +	 */
> +	spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
>  	entry = READ_ONCE(*(svm->avic_physical_id_cache));
>  	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
>  
> @@ -1042,12 +1062,15 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  
>  	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
>  	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
> +
> +	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>  }
>  
>  void avic_vcpu_put(struct kvm_vcpu *vcpu)
>  {
>  	u64 entry;
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	unsigned long flags;
>  
>  	lockdep_assert_preemption_disabled();
>  
> @@ -1057,10 +1080,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>  	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
>  		return;
>  
> +	spin_lock_irqsave(&svm->ir_list_lock, flags);
> +
>  	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
>  
>  	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
>  	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
> +
> +	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> +
>  }

Other than that, looks to be good. To some extent we are doing an update_ga()
just like Dengqiao was doing, but with the right protection both from IR
perspective and safe read of the VCPU cpu.