[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f250fc62-a4a6-6543-d688-e755729a7291@gmail.com>
Date: Mon, 24 Oct 2022 14:24:58 +0300
From: Tariq Toukan <ttoukan.linux@...il.com>
To: Valentin Schneider <vschneid@...hat.com>, netdev@...r.kernel.org,
linux-rdma@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: Tariq Toukan <tariqt@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>,
Leon Romanovsky <leon@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Yury Norov <yury.norov@...il.com>,
Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Mel Gorman <mgorman@...e.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Heiko Carstens <hca@...ux.ibm.com>,
Tony Luck <tony.luck@...el.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Gal Pressman <gal@...dia.com>,
Jesse Brandeburg <jesse.brandeburg@...el.com>
Subject: Re: [PATCH v5 3/3] net/mlx5e: Improve remote NUMA preferences used
for the IRQ affinity hints
On 10/21/2022 3:19 PM, Valentin Schneider wrote:
> From: Tariq Toukan <tariqt@...dia.com>
>
> In the IRQ affinity hints, replace the binary NUMA preference (local /
> remote) with the improved for_each_numa_hop_cpu() API that minds the
> actual distances, so that remote NUMAs with short distance are preferred
> over farther ones.
>
> This has significant performance implications when using NUMA-aware
> allocated memory (follow [1] and derivatives for example).
>
> [1]
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c :: mlx5e_open_channel()
> int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
>
> Performance tests:
>
> TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
> Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
>
> +-------------------------+-----------+------------------+------------------+
> | | BW (Gbps) | TX side CPU util | RX side CPU util |
> +-------------------------+-----------+------------------+------------------+
> | Baseline | 52.3 | 6.4 % | 17.9 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
> +-------------------------+-----------+------------------+------------------+
> | Applied on both sides | 95.1 | 8.4 % | 27.3 % |
> +-------------------------+-----------+------------------+------------------+
>
> Bottleneck in RX side is released, reached linerate (~1.8x speedup).
> ~30% less cpu util on TX.
>
> * CPU util on active cores only.
>
> Setups details (similar for both sides):
>
> NIC: ConnectX6-DX dual port, 100 Gbps each.
> Single port used in the tests.
>
> $ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 2
> NUMA node(s): 16
> Vendor ID: AuthenticAMD
> CPU family: 25
> Model: 1
> Model name: AMD EPYC 7763 64-Core Processor
> Stepping: 1
> CPU MHz: 2594.804
> BogoMIPS: 4890.73
> Virtualization: AMD-V
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 512K
> L3 cache: 32768K
> NUMA node0 CPU(s): 0-7,128-135
> NUMA node1 CPU(s): 8-15,136-143
> NUMA node2 CPU(s): 16-23,144-151
> NUMA node3 CPU(s): 24-31,152-159
> NUMA node4 CPU(s): 32-39,160-167
> NUMA node5 CPU(s): 40-47,168-175
> NUMA node6 CPU(s): 48-55,176-183
> NUMA node7 CPU(s): 56-63,184-191
> NUMA node8 CPU(s): 64-71,192-199
> NUMA node9 CPU(s): 72-79,200-207
> NUMA node10 CPU(s): 80-87,208-215
> NUMA node11 CPU(s): 88-95,216-223
> NUMA node12 CPU(s): 96-103,224-231
> NUMA node13 CPU(s): 104-111,232-239
> NUMA node14 CPU(s): 112-119,240-247
> NUMA node15 CPU(s): 120-127,248-255
> ..
...
>
> Signed-off-by: Tariq Toukan <tariqt@...dia.com>
> [Tweaked API use]
Thanks for your modification.
It looks good to me.
Signed-off-by: Tariq Toukan <tariqt@...dia.com>
> Signed-off-by: Valentin Schneider <vschneid@...hat.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index a0242dc15741c..7acbeb3d51846 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -812,9 +812,12 @@ static void comp_irqs_release(struct mlx5_core_dev *dev)
> static int comp_irqs_request(struct mlx5_core_dev *dev)
> {
> struct mlx5_eq_table *table = dev->priv.eq_table;
> + const struct cpumask *prev = cpu_none_mask;
> + const struct cpumask *mask;
> int ncomp_eqs = table->num_comp_eqs;
> u16 *cpus;
> int ret;
> + int cpu;
> int i;
>
> ncomp_eqs = table->num_comp_eqs;
> @@ -833,8 +836,19 @@ static int comp_irqs_request(struct mlx5_core_dev *dev)
> ret = -ENOMEM;
> goto free_irqs;
> }
> - for (i = 0; i < ncomp_eqs; i++)
> - cpus[i] = cpumask_local_spread(i, dev->priv.numa_node);
> +
> + i = 0;
> + rcu_read_lock();
> + for_each_numa_hop_mask(mask, dev->priv.numa_node) {
> + for_each_cpu_andnot(cpu, mask, prev) {
> + cpus[i] = cpu;
> + if (++i == ncomp_eqs)
> + goto spread_done;
> + }
> + prev = mask;
> + }
> +spread_done:
> + rcu_read_unlock();
> ret = mlx5_irqs_request_vectors(dev, cpus, ncomp_eqs, table->comp_irqs);
> kfree(cpus);
> if (ret < 0)
Powered by blists - more mailing lists