linux-kernel - Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <867cc9x8si.wl-maz@kernel.org>
Date: Thu, 22 Aug 2024 09:26:37 +0100
From: Marc Zyngier <maz@...nel.org>
To: Kunkun Jiang <jiangkunkun@...wei.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	Oliver Upton
	<oliver.upton@...ux.dev>,
	James Morse <james.morse@....com>,
	Suzuki K Poulose
	<suzuki.poulose@....com>,
	Zenghui Yu <yuzenghui@...wei.com>,
	"open list:IRQ\
 SUBSYSTEM" <linux-kernel@...r.kernel.org>,
	"moderated list:ARM SMMU DRIVERS"
	<linux-arm-kernel@...ts.infradead.org>,
	<kvmarm@...ts.linux.dev>,
	"wanghaibin.wang@...wei.com" <wanghaibin.wang@...wei.com>,
	<nizhiqiang1@...wei.com>,
	"tangnianyao@...wei.com" <tangnianyao@...wei.com>,
	<wangzhou1@...ilicon.com>
Subject: Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

On Wed, 21 Aug 2024 19:23:30 +0100,
Kunkun Jiang <jiangkunkun@...wei.com> wrote:
> 
> Hi Marc,
> 
> On 2024/8/21 18:59, Marc Zyngier wrote:
> > On Wed, 21 Aug 2024 10:51:27 +0100,
> > Kunkun Jiang <jiangkunkun@...wei.com> wrote:
> >> 
> >> Hi all,
> >> 
> >> Recently I discovered a problem about GICv4.1, the scenario is as follows:
> >> 1. Enable GICv4.1
> >> 2. Create multiple VMs.For example, 50 VMs(4U8G)
> 
> s/4U8G/8U16G/, sorry..
> 
> > I don't know what 4U8G means. On how many physical CPUs are you
> > running 50 VMs? Direct injection of interrupts and over-subscription
> > are fundamentally incompatible.
> 
> Each VM is configured with 8 vcpus and 16G memory. The number of
> physical CPUs is 320.

So you spawn 200 vcpus in one go. Fun.

> 
> > 
> >> 3. The business running in VMs has a frequent mmio access and need to exit
> >>    to qemu for processing.
> >> 4. Or modify the kvm code so that wfi must trap to kvm
> >> 5. Then the utilization of pcpu where the vcpu is located will be 100%,and
> >>    basically all in sys.
> > 
> > What did you expect? If you trap all the time, your performance will
> > suck.  Don't do that.
> > 
> >> 6. This problem does not exist in GICv3.
> > 
> > Because GICv3 doesn't have the same constraints.
> > 
> >> 
> >> According to analysis, this problem is due to the execution of vgic_v4_load.
> >> vcpu_load or kvm_sched_in
> >>      kvm_arch_vcpu_load
> >>      ...
> >>          vgic_v4_load
> >>              irq_set_affinity
> >>              ...
> >>                  irq_do_set_affinity
> >>                      raw_spin_lock(&tmp_mask_lock)
> >>                      chip->irq_set_affinity
> >>                      ...
> >>                        its_vpe_set_affinity
> >> 
> >> The tmp_mask_lock is the key. This is a global lock. I don't quite
> >> understand
> >> why tmp_mask_lock is needed here. I think there are two possible
> >> solutions here:
> >> 1. Remove this tmp_mask_lock
> > 
> > Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
> > would allow you to understand the nature of the problem.
> > 
> > This can probably be replaced with a per-CPU cpumask, which would
> > avoid the locking, but potentially result in a larger memory usage.
> 
> Thanks, I will try it.

A simple alternative would be this:

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..0d11b74af38c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	struct irq_desc *desc = irq_data_to_desc(data);
 	struct irq_chip *chip = irq_data_get_irq_chip(data);
 	const struct cpumask  *prog_mask;
+	struct cpumask tmp_mask = {};
 	int ret;
 
-	static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
-	static struct cpumask tmp_mask;
-
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
-	raw_spin_lock(&tmp_mask_lock);
 	/*
 	 * If this is a managed interrupt and housekeeping is enabled on
 	 * it check whether the requested affinity mask intersects with
@@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	else
 		ret = -EINVAL;
 
-	raw_spin_unlock(&tmp_mask_lock);
-
 	switch (ret) {
 	case IRQ_SET_MASK_OK:
 	case IRQ_SET_MASK_OK_DONE:

but that will eat a significant portion of your stack if your kernel is
configured for a large number of CPUs.

> 
> >> 2. Modify the gicv4 driver,do not perfrom VMOVP via
> >> irq_set_affinity.
> > 
> > Sure. You could also not use KVM at all if don't care about interrupts
> > being delivered to your VM. We do not send a VMOVP just for fun. We
> > send it because your vcpu has moved to a different CPU, and the ITS
> > needs to know about that.
> 
> When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
> I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
> other means (instead of by calling the irq_set_affinity() API). So we
> can bypass this tmp_mask_lock.

The whole point of this infrastructure is that the VPE doorbell is the
control point for the VPE. If the VPE moves, then the change of
affinity *must* be done using irq_set_affinity(). All the locking is
constructed around that. Please read the abundant documentation that
exists in both the GIC code and KVM describing why this is done like
that.

> 
> > 
> > You seem to be misunderstanding the use case for GICv4: a partitioned
> > system, without any over-subscription, no vcpu migration between CPUs.
> > If that's not your setup, then GICv4 will always be a net loss
> > compared to SW injection with GICv3 (additional HW interaction,
> > doorbell interrupts).
> 
> Thanks for the explanation. The key to the problem is not vcpu migration
> between CPUs. The key point is that many vcpus execute vgic_v4_load() at
> the same time. Even if it is not migrated to another CPU, there may be a
> large number of vcpus executing vgic_v4_load() at the same time. For
> example, the service running in VMs has a large number of MMIO accesses
> and need to return to userspace for emulation. Due to the competition of
> tmp_mask_lock, performance will deteriorate.

That's only a symptom. And that doesn't affect only pathological VM
workloads, but all interrupts being moved around for any reason.

> 
> When the target CPU is the same CPU as the last run, there seems to be
> no need to call irq_set_affinity() in this case. I did a test and it was
> indeed able to alleviate the problem described above.

The premise is that irq_set_affinity() should be cheap when there
isn't much to do, and you are papering over the problem.

> 
> I feel it might be better to remove tmp_mask_lock or call
> its_vpe_set_affinity() in another way. So I mentioned these two ideas
> above.

The removal of this global lock is the only option in my opinion.
Either the cpumask becomes a stack variable, or it becomes a static
per-CPU variable. Both have drawbacks, but they are not a bottleneck
anymore.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.