linux-kernel - Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <86zfp3wrmy.wl-maz@kernel.org>
Date: Fri, 23 Aug 2024 09:49:25 +0100
From: Marc Zyngier <maz@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>, Kunkun Jiang <jiangkunkun@...wei.com>
Cc: 	Oliver Upton <oliver.upton@...ux.dev>,
	James Morse
 <james.morse@....com>,
	Suzuki K Poulose <suzuki.poulose@....com>,
	Zenghui
 Yu <yuzenghui@...wei.com>,
	"open list:IRQ\
 SUBSYSTEM" <linux-kernel@...r.kernel.org>,
	"moderated list:ARM SMMU\
 DRIVERS" <linux-arm-kernel@...ts.infradead.org>,
	kvmarm@...ts.linux.dev,
	"wanghaibin.wang@...wei.com" <wanghaibin.wang@...wei.com>,
	nizhiqiang1@...wei.com,
	"tangnianyao@...wei.com" <tangnianyao@...wei.com>,
	wangzhou1@...ilicon.com
Subject: Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption

On Thu, 22 Aug 2024 22:20:43 +0100,
Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> On Thu, Aug 22 2024 at 13:47, Marc Zyngier wrote:
> > On Thu, 22 Aug 2024 11:59:50 +0100,
> > Kunkun Jiang <jiangkunkun@...wei.com> wrote:
> >> > but that will eat a significant portion of your stack if your kernel is
> >> > configured for a large number of CPUs.
> >> > 
> >> 
> >> Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
> >
> > This seems crazy. Why would you build a kernel with something *that*
> > big, specially considering that you have a lot less than 1k CPUs?
> 
> That's why CONFIG_CPUMASK_OFFSTACK exists, but that does not help in
> that context. :)
>
> >> > The removal of this global lock is the only option in my opinion.
> >> > Either the cpumask becomes a stack variable, or it becomes a static
> >> > per-CPU variable. Both have drawbacks, but they are not a bottleneck
> >> > anymore.
> >> 
> >> I also prefer to remove the global lock. Which variable do you think is
> >> better?
> >
> > Given the number of CPUs your system is configured for, there is no
> > good answer. An on-stack variable is dangerously large, and a per-CPU
> > cpumask results in 2MB being allocated, which I find insane.
> 
> Only if there are actually 4096 CPUs enumerated. The per CPU magic is
> smart enough to limit the damage to the actual number of possible CPUs
> which are enumerated at boot time. It still will over-allocate due to
> NR_CPUS being insanely large but on a 4 CPU machine this boils down to
> 2k of memory waste unless Aaarg64 is stupid enough to allocate for
> NR_CPUS instead of num_possible_cpus()...

No difference between arm64 and xyz85.999 here.

> 
> That said, on a real 4k CPU system 2M of memory should be the least of
> your worries.

Don't underestimate the general level of insanity!

> 
> > You'll have to pick your own poison and convince Thomas of the
> > validity of your approach.
> 
> As this is an operation which is really not suitable for on demand
> or large stack allocations the per CPU approach makes sense.

Right, so let's shoot for that. Kunkun, can you please give the
following hack a go with your workload?

Thanks,

	M.

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..b6aa259ac749 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	struct irq_desc *desc = irq_data_to_desc(data);
 	struct irq_chip *chip = irq_data_get_irq_chip(data);
 	const struct cpumask  *prog_mask;
+	struct cpumask *tmp_mask;
 	int ret;
 
-	static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
-	static struct cpumask tmp_mask;
+	static DEFINE_PER_CPU(struct cpumask, __tmp_mask);
 
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
-	raw_spin_lock(&tmp_mask_lock);
+	tmp_mask = this_cpu_ptr(&__tmp_mask);
+
 	/*
 	 * If this is a managed interrupt and housekeeping is enabled on
 	 * it check whether the requested affinity mask intersects with
@@ -258,11 +259,11 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 
 		hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 
-		cpumask_and(&tmp_mask, mask, hk_mask);
-		if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
+		cpumask_and(tmp_mask, mask, hk_mask);
+		if (!cpumask_intersects(tmp_mask, cpu_online_mask))
 			prog_mask = mask;
 		else
-			prog_mask = &tmp_mask;
+			prog_mask = tmp_mask;
 	} else {
 		prog_mask = mask;
 	}
@@ -272,16 +273,14 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	 * unless we are being asked to force the affinity (in which
 	 * case we do as we are told).
 	 */
-	cpumask_and(&tmp_mask, prog_mask, cpu_online_mask);
-	if (!force && !cpumask_empty(&tmp_mask))
-		ret = chip->irq_set_affinity(data, &tmp_mask, force);
+	cpumask_and(tmp_mask, prog_mask, cpu_online_mask);
+	if (!force && !cpumask_empty(tmp_mask))
+		ret = chip->irq_set_affinity(data, tmp_mask, force);
 	else if (force)
 		ret = chip->irq_set_affinity(data, mask, force);
 	else
 		ret = -EINVAL;
 
-	raw_spin_unlock(&tmp_mask_lock);
-
 	switch (ret) {
 	case IRQ_SET_MASK_OK:
 	case IRQ_SET_MASK_OK_DONE:

-- 
Without deviation from the norm, progress is not possible.