netdev - Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com>
Date: Wed, 11 Dec 2024 18:28:19 +0100
From: Dragos Tatulea <dtatulea@...dia.com>
To: Alexandra Winter <wintera@...ux.ibm.com>,
 Alexander Lobakin <aleksander.lobakin@...el.com>
Cc: Rahul Rameshbabu <rrameshbabu@...dia.com>,
 Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
 Leon Romanovsky <leon@...nel.org>, David Miller <davem@...emloft.net>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Eric Dumazet <edumazet@...gle.com>, Andrew Lunn <andrew+netdev@...n.ch>,
 Nils Hoppmann <niho@...ux.ibm.com>, netdev@...r.kernel.org,
 linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
 Vasily Gorbik <gor@...ux.ibm.com>, Alexander Gordeev
 <agordeev@...ux.ibm.com>, Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>,
 Thorsten Winkler <twinkler@...ux.ibm.com>, Simon Horman <horms@...nel.org>,
 Niklas Schnelle <schnelle@...ux.ibm.com>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb

Hi Alexandra,

On 11.12.24 14:35, Alexandra Winter wrote:
> 
> 
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@...dia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>>  {
>>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>  
>>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>>    +		skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>>    it's a good idea to rely on this.
>>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>>    DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390). 
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
> 
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> 
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).

Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
  CPU. This is to better isolate CPU behaviour. The idea is to have the
  CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
  (profile attached).
  Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.

The results:
```
Case:                                          Throughput   Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline     3.86Mb/s     30036552  
app cpu == irq cpu, nthreads= 25, skb_linear   3.52Mb/s     26969315  

app cpu == irq cpu, nthreads= 50, baseline     4.26Mb/s     33122458 
app cpu == irq cpu, nthreads= 50, skb_linear   4.06Mb/s     31606290 

app cpu == irq cpu, nthreads=100, baseline     4.08Mb/s     31775434  
app cpu == irq cpu, nthreads=100, skb_linear   3.86Mb/s     30105582

app cpu != irq cpu, nthreads= 25, baseline     3.57Mb/s     27785488
app cpu != irq cpu, nthreads= 25, skb_linear   3.56Mb/s     27730133

app cpu != irq cpu, nthreads= 50, baseline     3.97Mb/s     30876264
app cpu != irq cpu, nthreads= 50, skb_linear   3.82Mb/s     29737654

app cpu != irq cpu, nthreads=100, baseline     4.06Mb/s     31621140
app cpu != irq cpu, nthreads=100, skb_linear   3.90Mb/s     30364382
```

So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.

>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
> 
> 
> Thank you very much, everybody for looking into this.
> 
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
> 
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
> 
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
> 
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.

> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
> 
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.

> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain 
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
> 
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...

> What do you think of this approach?
> 
>
Sounds tricky. Let's see what Tariq has to say.

Thanks,
Dragos

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
        struct mlx5e_sq_stats *stats = sq->stats;
 
+       if (skb->len < 256)
+               skb_linearize(skb);
+
        if (skb_is_gso(skb)) {
                int hopbyhop;
                u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

View attachment "rr1c-1x1.xml" of type "text/xml" (522 bytes)