[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241204140230.23858-1-wintera@linux.ibm.com>
Date: Wed, 4 Dec 2024 15:02:30 +0100
From: Alexandra Winter <wintera@...ux.ibm.com>
To: Rahul Rameshbabu <rrameshbabu@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
Leon Romanovsky <leon@...nel.org>, David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Eric Dumazet <edumazet@...gle.com>,
Andrew Lunn <andrew+netdev@...n.ch>
Cc: Nils Hoppmann <niho@...ux.ibm.com>, netdev@...r.kernel.org,
linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Thorsten Winkler <twinkler@...ux.ibm.com>,
Simon Horman <horms@...nel.org>
Subject: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
Linearize the skb if the device uses IOMMU and the data buffer can fit
into one page. So messages can be transferred in one transfer to the card
instead of two.
Performance issue:
------------------
Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
tcp skbs are always non-linear. Especially on platforms with IOMMU,
mapping and unmapping two pages instead of one per transfer can make a
noticeable difference. On s390 we saw a 13% degradation in throughput,
when running uperf with a request-response pattern with 1k payload and
250 connections parallel. See [0] for a discussion.
This patch mitigates these effects using a work-around in the mlx5 driver.
Notes on implementation:
------------------------
TCP skbs never contain any tailroom, so skb_linearize() will allocate a
new data buffer.
No need to handle rc of skb_linearize(). If it fails, we continue with the
unchanged skb.
As mentioned in the discussion, an alternative, but more invasive approach
would be: premapping a coherent piece of memory in which you can copy
small skbs.
Measurement results:
--------------------
We see an improvement in throughput of up to 16% compared to kernel v6.12.
We measured throughput and CPU consumption of uperf benchmarks with
ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
with and without this patch.
+------------------------------------------+
| Transactions per Second - Deviation in % |
+-------------------+----------------------+
| Workload | |
| rr1c-1x1--50 | 4.75 |
| rr1c-1x1-250 | 14.53 |
| rr1c-200x1000--50 | 2.22 |
| rr1c-200x1000-250 | 12.24 |
+-------------------+----------------------+
| Server CPU Consumption - Deviation in % |
+-------------------+----------------------+
| Workload | |
| rr1c-1x1--50 | -1.66 |
| rr1c-1x1-250 | -10.00 |
| rr1c-200x1000--50 | -0.83 |
| rr1c-200x1000-250 | -8.71 |
+-------------------+----------------------+
Note:
- CPU consumption: less is better
- Client CPU consumption is similar
- Workload:
rr1c-<bytes send>x<bytes received>-<parallel connections>
Highly transactional small data sizes (rr1c-1x1)
This is a Request & Response (RR) test that sends a 1-byte request
from the client and receives a 1-byte response from the server. This
is the smallest possible transactional workload test and is smaller
than most customer workloads. This test represents the RR overhead
costs.
Highly transactional medium data sizes (rr1c-200x1000)
Request & Response (RR) test that sends a 200-byte request from the
client and receives a 1000-byte response from the server. This test
should be representative of a typical user's interaction with a remote
web site.
Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
Suggested-by: Rahul Rameshbabu <rrameshbabu@...dia.com>
Signed-off-by: Alexandra Winter <wintera@...ux.ibm.com>
Co-developed-by: Nils Hoppmann <niho@...ux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@...ux.ibm.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..421ba6798ca7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -32,6 +32,7 @@
#include <linux/tcp.h>
#include <linux/if_vlan.h>
+#include <linux/iommu-dma.h>
#include <net/geneve.h>
#include <net/dsfield.h>
#include "en.h"
@@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
{
struct mlx5e_sq_stats *stats = sq->stats;
+ /* Don't require 2 IOMMU TLB entries, if one is sufficient */
+ if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
+ skb_linearize(skb);
+
if (skb_is_gso(skb)) {
int hopbyhop;
u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
--
2.45.2
Powered by blists - more mailing lists