[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8e529b2-1454-4c3f-aa49-b3d989e1014a@intel.com>
Date: Wed, 4 Dec 2024 15:32:59 +0100
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Alexandra Winter <wintera@...ux.ibm.com>
CC: Rahul Rameshbabu <rrameshbabu@...dia.com>, Saeed Mahameed
<saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, Leon Romanovsky
<leon@...nel.org>, David Miller <davem@...emloft.net>, Jakub Kicinski
<kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Eric Dumazet
<edumazet@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>, Nils Hoppmann
<niho@...ux.ibm.com>, <netdev@...r.kernel.org>, <linux-s390@...r.kernel.org>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>, Christian Borntraeger
<borntraeger@...ux.ibm.com>, Sven Schnelle <svens@...ux.ibm.com>, "Thorsten
Winkler" <twinkler@...ux.ibm.com>, Simon Horman <horms@...nel.org>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
From: Alexandra Winter <wintera@...ux.ibm.com>
Date: Wed, 4 Dec 2024 15:02:30 +0100
> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.
I'd expect this to be on the generic level, not copied over the drivers?
Not sure about PAGE_SIZE, but I never saw a NIC/driver/platform where
copying let's say 256 bytes would be slower than 2x dma_map (even with
direct DMA).
>
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
>
> This patch mitigates these effects using a work-around in the mlx5 driver.
>
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
>
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.
Yes, that one would be better.
[...]
> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> {
> struct mlx5e_sq_stats *stats = sq->stats;
>
> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
1. What's with the direct DMA? I believe it would benefit, too?
2. Why truesize, not something like
if (skb->len <= some_sane_value_maybe_1k)
3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
it's a good idea to rely on this.
Some test-based hardcode would be enough (i.e. threshold on which
DMA mapping starts performing better).
> + skb_linearize(skb);
> +
> if (skb_is_gso(skb)) {
BTW can't there be a case when the skb is GSO, but its truesize is
PAGE_SIZE and linearize will be way too slow (not sure it's possible,
just guessing)?
> int hopbyhop;
> u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
Thanks,
Olek
Powered by blists - more mailing lists