linux-kernel - Re: [PATCH] KVM: SVM: Update destination when updating pi irte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <caefe41b-2736-3df9-b5cd-b81fc4c30ff0@oracle.com>
Date:   Fri, 21 Jul 2023 14:44:29 -0400
From:   Alejandro Jimenez <alejandro.j.jimenez@...cle.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        Joao Martins <joao.m.martins@...cle.com>
Cc:     Maxim Levitsky <mlevitsk@...hat.com>,
        "dengqiao.joey" <dengqiao.joey@...edance.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [PATCH] KVM: SVM: Update destination when updating pi irte


On 7/19/23 11:57, Sean Christopherson wrote:
> On Fri, Jul 14, 2023, Joao Martins wrote:
>> +Suravee, +Alejandro
>>
>> On 29/06/2023 23:35, Sean Christopherson wrote:
>>> On Thu, May 18, 2023, Joao Martins wrote:
>>>> On 18/05/2023 09:19, Maxim Levitsky wrote:
>>>>> I think that we do need to a flag indicating if the vCPU is currently
>>>>> running and if yes, then use svm->vcpu.cpu (or put -1 to it when it not
>>>>> running or something) (currently the vcpu->cpu remains set when vCPU is
>>>>> put)
>>>>>
>>>>> In other words if a vCPU is running, then avic_pi_update_irte should put
>>>>> correct pCPU number, and if it raced with vCPU put/load, then later should
>>>>> win and put the correct value.  This can be done either with a lock or
>>>>> barriers.
>>>>>
>>>> If this could be done, it could remove cost from other places and avoid this
>>>> little dance of the galog (and avoid its usage as it's not the greatest design
>>>> aspect of the IOMMU). We anyways already need to do IRT flushes in all these
>>>> things with regards to updating any piece of the IRTE, but we need some care
>>>> there two to avoid invalidating too much (which is just as expensive and per-VCPU).
>>> ...
>>>
>>>> But still quite expensive (as many IPIs as vCPUs updated), but it works as
>>>> intended and guest will immediately see the right vcpu affinity. But I honestly
>>>> prefer going towards your suggestion (via vcpu.pcpu) if we can have some
>>>> insurance that vcpu.cpu is safe to use in pi_update_irte if protected against
>>>> preemption/blocking of the VCPU.
>>> I think we have all the necessary info, and even a handy dandy spinlock to ensure
>>> ordering.  Disclaimers: compile tested only, I know almost nothing about the IOMMU
>>> side of things, and I don't know if I understood the needs for the !IsRunning cases.
>>>
>> I was avoiding grabbing that lock, but now that I think about it it shouldn't do
>> much harm.
>>
>> My only concern has mostly been whether we mark the IRQ isRunning=1 on a vcpu
>> that is about to block as then the doorbell rang by the IOMMU won't do anything
>> to the guest. But IIUC the physical ID cache read-once should cover that
> Acquiring ir_list_lock in avic_vcpu_{load,put}() when modifying
> AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK is the key to avoiding ordering issues.
> E.g. without the spinlock, READ_ONCE() wouldn't prevent svm_ir_list_add() from
> racing with avic_vcpu_{load,put}() and ultimately shoving stale data into the IRTE.
>
> It *should* actually be safe to drop the READ_ONCE() since acquiring/releasing
> the spinlock will prevent multiple loads from observing different values.  I kept
> them mostly to keep the diff small, and to be conservative.
>
> The WRITE_ONCE() needs to stay to ensure that hardware doesn't see inconsitent
> information due to store tearing.
>
> If this patch works, I think it makes sense to follow-up with a cleanup patch to
> drop the READ_ONCE() and add comments explaining why KVM uses WRITE_ONCE() but
> not READ_ONCE().
I tested the patch on top of v6.5-rc1, to also use the host kernel param 
"amd_iommu=irtcachedis" from:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66419036f68a

and found no issues running a 380 vCPU guest (using idle=poll), with a 
mlx VF, on a Genoa host.

I didn't run an interrupt intensive workload, but stressed the affinity 
changes in a tight loop on the guest running:

--
rcpu=$(($RANDOM % $(nproc)))

# the mlx5_comp* IRQs are in the 35-49 range
rirq=$((35 + $RANDOM % (50-35)))

echo $rcpu > /proc/irq/$rirq/smp_affinity_list

--

As suggested by Joao, I checked to see if there were any 'spurious' GA 
Log events that are received while the target vCPU is running. The 
window for this to happen is quite tight with the new changes, so after 
100k affinity changes there are only 2 reported GA Log events on the guest:

--
@galog_events: 2
@galog_on_running[303]: 1
@galog_on_running[222]: 1

@vcpuHist:
[128, 256)             1 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)             1 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

--

where when running with an unpatched host kernel there will be a 
significant number detected:

--
@galog_events: 2476

[...]

@vcpuHist:
[0]                    2 
|                                                    |
[1]                    1 
|                                                    |
[2, 4)                11 
|                                                    |
[4, 8)                13 
|                                                    |
[8, 16)               51 
|@@@                                                 |
[16, 32)              99 
|@@@@@                                               |
[32, 64)             213 
|@@@@@@@@@@@@                                        |
[64, 128)            381 
|@@@@@@@@@@@@@@@@@@@@@@                              |
[128, 256)           869 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           834 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |

--

The script I used makes assumptions about strict ordering in which the 
probes will be registered, but given that the observed behavior is as 
predicted the assumptions seem sound. It is pasted below, in case there 
are concerns about the logic.

Thank you,

Alejandro

---

#!/usr/bin/bpftrace

BEGIN
{
     printf("Tracing GALog events between affinity updates... Hit Ctrl-C 
to end.\n");
     zero(@irteModified);
     zero(@galog_events);
     zero(@galog_on_running);
}

/*
  * amd_iommu_activate_guest_mode() is mostly called from
  * amd_ir_set_vcpu_affinity() to process vCPU affinity updates, but it also
  * gets called by avic_set_pi_irte_mode() for re-enabling AVIC after 
inhibition
  * so this data is unreliable during boot time where most of the inhibition
  * events take place.
  */
kprobe:amd_iommu_activate_guest_mode
{
     /*
      * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
      * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
      * shows offset of u32 ga_tag field is 0x40
      * AVIC GATAG encodes vCPU ID in LSB 9 bits
      */
     $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

     @irteModified[$vcpu_gatag] = 1;
}

tracepoint:kvm:kvm_avic_ga_log
{
     $vcpuid = args->vcpuid;

     @galog_events = count();

     /*
      * GALog reports an entry, and it's expected that the 
IRTE.isRunning bit
      * is 0. The question becomes if it was cleared by an affinity update
      * and has not been restored by a subsequent call to 
amd_iommu_update_ga
      */
     if (@irteModified[args->vcpuid] != 0) {
         @galog_on_running[args->vcpuid] = count();

         @vcpuHist = hist(args->vcpuid);
         //@...uHist = lhist($args->vcpuid, 0, 380, 1);
     }
}


kprobe:amd_iommu_update_ga
{
     /*
      * $vcpu_gatag = (struct amd_ir_data *)arg0->gatag;
      * pahole -C amd_ir_data --hex drivers/iommu/amd/iommu.o
      * shows offset of u32 ga_tag field is 0x40
      * AVIC GATAG encodes vCPU ID in LSB 9 bits
      */
     $vcpu_gatag = (*(uint32 *)(arg0 + 0x40)) & 0x1FF;

     /*
      * With the guest running with idle=poll, avic_vcpu_put() should not
      * be called, and any GA Log events detected must be spurious i.e.
      * targetting a vCPU that is currently running. Only clear the flag
      * when setting IsRun to 1 (as in via avic_vcpu_load() or
      * svm_ir_list_add()), to capture the spurious GA Log events.
      * arg1 ==> isRun
      */
     if (arg1 != 0) {
         @irteModified[$vcpu_gatag] = 0;
     }
}

END
{
     clear(@irteModified);
     print(@galog_events);
     print(@galog_on_running);
     print(@vcpuHist);
}