linux-kernel - RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157CB3CB55A17255AE61BF6D46A2@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Tue, 9 Jan 2024 19:22:38 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com>,
	"kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com"
	<haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>, "davem@...emloft.net"
	<davem@...emloft.net>, "edumazet@...gle.com" <edumazet@...gle.com>,
	"kuba@...nel.org" <kuba@...nel.org>, "pabeni@...hat.com" <pabeni@...hat.com>,
	"longli@...rosoft.com" <longli@...rosoft.com>, "yury.norov@...il.com"
	<yury.norov@...il.com>, "leon@...nel.org" <leon@...nel.org>,
	"cai.huoqing@...ux.dev" <cai.huoqing@...ux.dev>,
	"ssengar@...ux.microsoft.com" <ssengar@...ux.microsoft.com>,
	"vkuznets@...hat.com" <vkuznets@...hat.com>, "tglx@...utronix.de"
	<tglx@...utronix.de>, "linux-hyperv@...r.kernel.org"
	<linux-hyperv@...r.kernel.org>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-rdma@...r.kernel.org"
	<linux-rdma@...r.kernel.org>
CC: "schakrabarti@...rosoft.com" <schakrabarti@...rosoft.com>,
	"paulros@...rosoft.com" <paulros@...rosoft.com>
Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per
 CPUs

From: Souradeep Chakrabarti <schakrabarti@...ux.microsoft.com> Sent: Tuesday, January 9, 2024 2:51 AM
> 
> From: Yury Norov <yury.norov@...il.com>
> 
> Souradeep investigated that the driver performs faster if IRQs are
> spread on CPUs with the following heuristics:
> 
> 1. No more than one IRQ per CPU, if possible;
> 2. NUMA locality is the second priority;
> 3. Sibling dislocality is the last priority.
> 
> Let's consider this topology:
> 
> Node            0               1
> Core        0       1       2       3
> CPU       0   1   2   3   4   5   6   7
> 
> The most performant IRQ distribution based on the above topology
> and heuristics may look like this:
> 
> IRQ     Nodes   Cores   CPUs
> 0       1       0       0-1
> 1       1       1       2-3
> 2       1       0       0-1
> 3       1       1       2-3
> 4       2       2       4-5
> 5       2       3       6-7
> 6       2       2       4-5
> 7       2       3       6-7

I didn't pay attention to the detailed discussion of this issue
over the past 2 to 3 weeks during the holidays in the U.S., but
the above doesn't align with the original problem as I understood
it.  I thought the original problem was to avoid putting IRQs on
both hyper-threads in the same core, and that the perf
improvements are based on that configuration.  At least that's
what the commit message for Patch 4/4 in this series says.

The above chart results in 8 IRQs being assigned to the 8 CPUs,
probably with 1 IRQ per CPU.   At least on x86, if the affinity
mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
should balance the IRQ assignments between the CPUs in the mask.
So the original problem is still present because both hyper-threads
in a core are likely to have an IRQ assigned.

Of course, this example has 8 IRQs and 8 CPUs, so assigning an
IRQ to every hyper-thread may be the only choice.  If that's the
case, maybe this just isn't a good example to illustrate the
original problem and solution.  But even with a better example
where the # of IRQs is <= half the # of CPUs in a NUMA node,
I don't think the code below accomplishes the original intent.

Maybe I've missed something along the way in getting to this
version of the patch.  Please feel free to set me straight. :-)

Michael

> 
> The irq_setup() routine introduced in this patch leverages the
> for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
> as described above.
> 
> According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
> based on cpumask_local_spread() performance test results look like this:
> 
> /ntttcp -r -m 16
> NTTTCP for Linux 1.4.0
> ---------------------------------------------------------
> 08:05:20 INFO: 17 threads created
> 08:05:28 INFO: Network activity progressing...
> 08:06:28 INFO: Test run completed.
> 08:06:28 INFO: Test cycle finished.
> 08:06:28 INFO: #####  Totals:  #####
> 08:06:28 INFO: test duration    :60.00 seconds
> 08:06:28 INFO: total bytes      :630292053310
> 08:06:28 INFO:   throughput     :84.04Gbps
> 08:06:28 INFO:   retrans segs   :4
> 08:06:28 INFO: cpu cores        :192
> 08:06:28 INFO:   cpu speed      :3799.725MHz
> 08:06:28 INFO:   user           :0.05%
> 08:06:28 INFO:   system         :1.60%
> 08:06:28 INFO:   idle           :96.41%
> 08:06:28 INFO:   iowait         :0.00%
> 08:06:28 INFO:   softirq        :1.94%
> 08:06:28 INFO:   cycles/byte    :2.50
> 08:06:28 INFO: cpu busy (all)   :534.41%
> 
> For NUMA- and sibling-aware IRQ distribution, the same test works
> 15% faster:
> 
> /ntttcp -r -m 16
> NTTTCP for Linux 1.4.0
> ---------------------------------------------------------
> 08:08:51 INFO: 17 threads created
> 08:08:56 INFO: Network activity progressing...
> 08:09:56 INFO: Test run completed.
> 08:09:56 INFO: Test cycle finished.
> 08:09:56 INFO: #####  Totals:  #####
> 08:09:56 INFO: test duration    :60.00 seconds
> 08:09:56 INFO: total bytes      :741966608384
> 08:09:56 INFO:   throughput     :98.93Gbps
> 08:09:56 INFO:   retrans segs   :6
> 08:09:56 INFO: cpu cores        :192
> 08:09:56 INFO:   cpu speed      :3799.791MHz
> 08:09:56 INFO:   user           :0.06%
> 08:09:56 INFO:   system         :1.81%
> 08:09:56 INFO:   idle           :96.18%
> 08:09:56 INFO:   iowait         :0.00%
> 08:09:56 INFO:   softirq        :1.95%
> 08:09:56 INFO:   cycles/byte    :2.25
> 08:09:56 INFO: cpu busy (all)   :569.22%
> 
> [1]
> https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3
> yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
> 
> Signed-off-by: Yury Norov <yury.norov@...il.com>
> Co-developed-by: Souradeep Chakrabarti
> <schakrabarti@...ux.microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 29
> +++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 6367de0c2c2e..6a967d6be01e 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1243,6 +1243,35 @@ void mana_gd_free_res_map(struct gdma_resource *r)
>  	r->size = 0;
>  }
> 
> +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node)
> +{
> +	const struct cpumask *next, *prev = cpu_none_mask;
> +	cpumask_var_t cpus __free(free_cpumask_var);
> +	int cpu, weight;
> +
> +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	for_each_numa_hop_mask(next, node) {
> +		weight = cpumask_weight_andnot(next, prev);
> +		while (weight > 0) {
> +			cpumask_andnot(cpus, next, prev);
> +			for_each_cpu(cpu, cpus) {
> +				if (len-- == 0)
> +					goto done;
> +				irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
> +				cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu));
> +				--weight;
> +			}
> +		}
> +		prev = next;
> +	}
> +done:
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
>  static int mana_gd_setup_irqs(struct pci_dev *pdev)
>  {
>  	unsigned int max_queues_per_port = num_online_cpus();
> --
> 2.34.1
>