[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJZfKntPrZdC=oc0_8j89a7was90+6Fh=fCf4hR7LZYSQ@mail.gmail.com>
Date: Wed, 4 Dec 2024 15:36:03 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Alexandra Winter <wintera@...ux.ibm.com>
Cc: Rahul Rameshbabu <rrameshbabu@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
Tariq Toukan <tariqt@...dia.com>, Leon Romanovsky <leon@...nel.org>, David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Andrew Lunn <andrew+netdev@...n.ch>,
Nils Hoppmann <niho@...ux.ibm.com>, netdev@...r.kernel.org, linux-s390@...r.kernel.org,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>, Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>, Thorsten Winkler <twinkler@...ux.ibm.com>,
Simon Horman <horms@...nel.org>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
On Wed, Dec 4, 2024 at 3:16 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@...ux.ibm.com> wrote:
> >
> > Linearize the skb if the device uses IOMMU and the data buffer can fit
> > into one page. So messages can be transferred in one transfer to the card
> > instead of two.
> >
> > Performance issue:
> > ------------------
> > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> > tcp skbs are always non-linear. Especially on platforms with IOMMU,
> > mapping and unmapping two pages instead of one per transfer can make a
> > noticeable difference. On s390 we saw a 13% degradation in throughput,
> > when running uperf with a request-response pattern with 1k payload and
> > 250 connections parallel. See [0] for a discussion.
> >
> > This patch mitigates these effects using a work-around in the mlx5 driver.
> >
> > Notes on implementation:
> > ------------------------
> > TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> > new data buffer.
> > No need to handle rc of skb_linearize(). If it fails, we continue with the
> > unchanged skb.
> >
> > As mentioned in the discussion, an alternative, but more invasive approach
> > would be: premapping a coherent piece of memory in which you can copy
> > small skbs.
> >
> > Measurement results:
> > --------------------
> > We see an improvement in throughput of up to 16% compared to kernel v6.12.
> > We measured throughput and CPU consumption of uperf benchmarks with
> > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> > with and without this patch.
> >
> > +------------------------------------------+
> > | Transactions per Second - Deviation in % |
> > +-------------------+----------------------+
> > | Workload | |
> > | rr1c-1x1--50 | 4.75 |
> > | rr1c-1x1-250 | 14.53 |
> > | rr1c-200x1000--50 | 2.22 |
> > | rr1c-200x1000-250 | 12.24 |
> > +-------------------+----------------------+
> > | Server CPU Consumption - Deviation in % |
> > +-------------------+----------------------+
> > | Workload | |
> > | rr1c-1x1--50 | -1.66 |
> > | rr1c-1x1-250 | -10.00 |
> > | rr1c-200x1000--50 | -0.83 |
> > | rr1c-200x1000-250 | -8.71 |
> > +-------------------+----------------------+
> >
> > Note:
> > - CPU consumption: less is better
> > - Client CPU consumption is similar
> > - Workload:
> > rr1c-<bytes send>x<bytes received>-<parallel connections>
> >
> > Highly transactional small data sizes (rr1c-1x1)
> > This is a Request & Response (RR) test that sends a 1-byte request
> > from the client and receives a 1-byte response from the server. This
> > is the smallest possible transactional workload test and is smaller
> > than most customer workloads. This test represents the RR overhead
> > costs.
> > Highly transactional medium data sizes (rr1c-200x1000)
> > Request & Response (RR) test that sends a 200-byte request from the
> > client and receives a 1000-byte response from the server. This test
> > should be representative of a typical user's interaction with a remote
> > web site.
> >
> > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
> > Suggested-by: Rahul Rameshbabu <rrameshbabu@...dia.com>
> > Signed-off-by: Alexandra Winter <wintera@...ux.ibm.com>
> > Co-developed-by: Nils Hoppmann <niho@...ux.ibm.com>
> > Signed-off-by: Nils Hoppmann <niho@...ux.ibm.com>
> > ---
> > drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > index f8c7912abe0e..421ba6798ca7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > @@ -32,6 +32,7 @@
> >
> > #include <linux/tcp.h>
> > #include <linux/if_vlan.h>
> > +#include <linux/iommu-dma.h>
> > #include <net/geneve.h>
> > #include <net/dsfield.h>
> > #include "en.h"
> > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > {
> > struct mlx5e_sq_stats *stats = sq->stats;
> >
> > + /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> > + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> > + skb_linearize(skb);
> > +
> > if (skb_is_gso(skb)) {
> > int hopbyhop;
> > u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> > --
> > 2.45.2
>
>
> Was this tested on x86_64 or any other arch than s390, especially ones
> with PAGE_SIZE = 65536 ?
I would suggest the opposite : copy the headers (typically less than
128 bytes) on a piece of coherent memory.
As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
include/net/tso.h and net/core/tso.c users do this.
Sure, patch is going to be more invasive, but all arches will win.
Powered by blists - more mailing lists