[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1574274-efd8-eb56-436b-5a1dd7620f2c@huawei.com>
Date: Thu, 22 Aug 2024 02:23:30 +0800
From: Kunkun Jiang <jiangkunkun@...wei.com>
To: Marc Zyngier <maz@...nel.org>
CC: Thomas Gleixner <tglx@...utronix.de>, Oliver Upton
<oliver.upton@...ux.dev>, James Morse <james.morse@....com>, Suzuki K Poulose
<suzuki.poulose@....com>, Zenghui Yu <yuzenghui@...wei.com>, "open list:IRQ
SUBSYSTEM" <linux-kernel@...r.kernel.org>, "moderated list:ARM SMMU DRIVERS"
<linux-arm-kernel@...ts.infradead.org>, <kvmarm@...ts.linux.dev>,
"wanghaibin.wang@...wei.com" <wanghaibin.wang@...wei.com>,
<nizhiqiang1@...wei.com>, "tangnianyao@...wei.com" <tangnianyao@...wei.com>,
<wangzhou1@...ilicon.com>
Subject: Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the
same time will greatly increase the time consumption
Hi Marc,
On 2024/8/21 18:59, Marc Zyngier wrote:
> On Wed, 21 Aug 2024 10:51:27 +0100,
> Kunkun Jiang <jiangkunkun@...wei.com> wrote:
>>
>> Hi all,
>>
>> Recently I discovered a problem about GICv4.1, the scenario is as follows:
>> 1. Enable GICv4.1
>> 2. Create multiple VMs.For example, 50 VMs(4U8G)
s/4U8G/8U16G/, sorry..
> I don't know what 4U8G means. On how many physical CPUs are you
> running 50 VMs? Direct injection of interrupts and over-subscription
> are fundamentally incompatible.
Each VM is configured with 8 vcpus and 16G memory. The number of
physical CPUs is 320.
>
>> 3. The business running in VMs has a frequent mmio access and need to exit
>> to qemu for processing.
>> 4. Or modify the kvm code so that wfi must trap to kvm
>> 5. Then the utilization of pcpu where the vcpu is located will be 100%,and
>> basically all in sys.
>
> What did you expect? If you trap all the time, your performance will
> suck. Don't do that.
>
>> 6. This problem does not exist in GICv3.
>
> Because GICv3 doesn't have the same constraints.
>
>>
>> According to analysis, this problem is due to the execution of vgic_v4_load.
>> vcpu_load or kvm_sched_in
>> kvm_arch_vcpu_load
>> ...
>> vgic_v4_load
>> irq_set_affinity
>> ...
>> irq_do_set_affinity
>> raw_spin_lock(&tmp_mask_lock)
>> chip->irq_set_affinity
>> ...
>> its_vpe_set_affinity
>>
>> The tmp_mask_lock is the key. This is a global lock. I don't quite
>> understand
>> why tmp_mask_lock is needed here. I think there are two possible
>> solutions here:
>> 1. Remove this tmp_mask_lock
>
> Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
> would allow you to understand the nature of the problem.
>
> This can probably be replaced with a per-CPU cpumask, which would
> avoid the locking, but potentially result in a larger memory usage.
Thanks, I will try it.
>> 2. Modify the gicv4 driver,do not perfrom VMOVP via
>> irq_set_affinity.
>
> Sure. You could also not use KVM at all if don't care about interrupts
> being delivered to your VM. We do not send a VMOVP just for fun. We
> send it because your vcpu has moved to a different CPU, and the ITS
> needs to know about that.
When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
other means (instead of by calling the irq_set_affinity() API). So we
can bypass this tmp_mask_lock.
>
> You seem to be misunderstanding the use case for GICv4: a partitioned
> system, without any over-subscription, no vcpu migration between CPUs.
> If that's not your setup, then GICv4 will always be a net loss
> compared to SW injection with GICv3 (additional HW interaction,
> doorbell interrupts).
Thanks for the explanation. The key to the problem is not vcpu migration
between CPUs. The key point is that many vcpus execute vgic_v4_load() at
the same time. Even if it is not migrated to another CPU, there may be a
large number of vcpus executing vgic_v4_load() at the same time. For
example, the service running in VMs has a large number of MMIO accesses
and need to return to userspace for emulation. Due to the competition of
tmp_mask_lock, performance will deteriorate.
When the target CPU is the same CPU as the last run, there seems to be
no need to call irq_set_affinity() in this case. I did a test and it was
indeed able to alleviate the problem described above.
I feel it might be better to remove tmp_mask_lock or call
its_vpe_set_affinity() in another way. So I mentioned these two ideas
above.
Thanks,
Kunkun Jiang
Powered by blists - more mailing lists