[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com>
Date: Wed, 11 Dec 2024 18:28:19 +0100
From: Dragos Tatulea <dtatulea@...dia.com>
To: Alexandra Winter <wintera@...ux.ibm.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>
Cc: Rahul Rameshbabu <rrameshbabu@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
Leon Romanovsky <leon@...nel.org>, David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Eric Dumazet <edumazet@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>,
Nils Hoppmann <niho@...ux.ibm.com>, netdev@...r.kernel.org,
linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>, Alexander Gordeev
<agordeev@...ux.ibm.com>, Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Thorsten Winkler <twinkler@...ux.ibm.com>, Simon Horman <horms@...nel.org>,
Niklas Schnelle <schnelle@...ux.ibm.com>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
Hi Alexandra,
On 11.12.24 14:35, Alexandra Winter wrote:
>
>
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@...dia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>> {
>>>>>> struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>
>>>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>> + skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>> it's a good idea to rely on this.
>>>>> Some test-based hardcode would be enough (i.e. threshold on which
>>>>> DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390).
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
>
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
>
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).
Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
CPU. This is to better isolate CPU behaviour. The idea is to have the
CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
(profile attached).
Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.
The results:
```
Case: Throughput Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline 3.86Mb/s 30036552
app cpu == irq cpu, nthreads= 25, skb_linear 3.52Mb/s 26969315
app cpu == irq cpu, nthreads= 50, baseline 4.26Mb/s 33122458
app cpu == irq cpu, nthreads= 50, skb_linear 4.06Mb/s 31606290
app cpu == irq cpu, nthreads=100, baseline 4.08Mb/s 31775434
app cpu == irq cpu, nthreads=100, skb_linear 3.86Mb/s 30105582
app cpu != irq cpu, nthreads= 25, baseline 3.57Mb/s 27785488
app cpu != irq cpu, nthreads= 25, skb_linear 3.56Mb/s 27730133
app cpu != irq cpu, nthreads= 50, baseline 3.97Mb/s 30876264
app cpu != irq cpu, nthreads= 50, skb_linear 3.82Mb/s 29737654
app cpu != irq cpu, nthreads=100, baseline 4.06Mb/s 31621140
app cpu != irq cpu, nthreads=100, skb_linear 3.90Mb/s 30364382
```
So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.
>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
>
>
> Thank you very much, everybody for looking into this.
>
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
>
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
>
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
>
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.
> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
>
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.
> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
>
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...
> What do you think of this approach?
>
>
Sounds tricky. Let's see what Tariq has to say.
Thanks,
Dragos
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
{
struct mlx5e_sq_stats *stats = sq->stats;
+ if (skb->len < 256)
+ skb_linearize(skb);
+
if (skb_is_gso(skb)) {
int hopbyhop;
u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
View attachment "rr1c-1x1.xml" of type "text/xml" (522 bytes)
Powered by blists - more mailing lists