[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <886c690b-cc35-39a0-8397-834e70fb329b@linux.ibm.com>
Date: Mon, 26 Sep 2022 12:06:59 +0200
From: Alexandra Winter <wintera@...ux.ibm.com>
To: Saeed Mahameed <saeedm@...dia.com>
Cc: David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Niklas Schnelle <schnelle@...ux.ibm.com>,
netdev <netdev@...r.kernel.org>, linux-s390@...r.kernel.org,
Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Eric Dumazet <edumazet@...gle.com>
Subject: Re: [RFC net] net/mlx5: Fix performance regression for
request-response workloads
On 08.09.22 14:41, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
> <borntraeger@...ux.ibm.com> wrote:
>>
>> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
>>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@...ux.ibm.com> wrote:
>>>>
>>>> Since linear payload was removed even for single small messages,
>>>> an additional page is required and we are measuring performance impact.
>>>>
>>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>>>> explicitely allowed "payload in skb->head for first skb put in the queue,
>>>> to not impact RPC workloads."
>>>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>>>> made that obsolete and removed it.
>>>> When
>>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>>>> reverted it, this piece was not reverted and not added back in.
>>>>
>>>> When running uperf with a request-response pattern with 1k payload
>>>> and 250 connections parallel, we measure 13% difference in throughput
>>>> for our PCI based network interfaces since 472c2e07eef0.
>>>> (our IO MMU is sensitive to the number of mapped pages)
>>>
>>>
>>>
>>>>
>>>> Could you please consider allowing linear payload for the first
>>>> skb in queue again? A patch proposal is appended below.
>>>
>>> No.
>>>
>>> Please add a work around in your driver.
>>>
>>> You can increase throughput by 20% by premapping a coherent piece of
>>> memory in which
>>> you can copy small skbs (skb->head included)
>>>
>>> Something like 256 bytes per slot in the TX ring.
>>>
>>
>> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
>
> I did not claim this was s390 specific.
>
> Only IOMMU mode.
>
> I would rather not add back something which makes TCP stack slower
> (more tests in fast path)
> for the majority of us _not_ using IOMMU.
>
> In our own tests, this trick of using linear skbs was only helping
> benchmarks, not real workloads.
>
> Many drivers have to map skb->head a second time if they contain TCP payload,
> thus adding yet another corner case in their fast path.
>
> - Typical RPC workloads are playing with TCP_NODELAY
> - Typical bulk flows never have empty write queues...
>
> Really, I do not want this optimization back, this is not worth it.
>
> Again, a driver knows better if it is using IOMMU and if pathological
> layouts can be optimized
> to non SG ones, and using a pre-dma-map zone will also benefit pure
> TCP ACK packets (which do not have any payload)
>
> Here is the changelog of a patch I did for our GQ NIC (not yet
> upstreamed, but will be soon)
>
[...]
Saeed,
As discussed at LPC, could you please consider adding a workaround to the
Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
we are seeing 13% throughput degradation, if 2 pages need to be mapped
instead of 1.
While Eric's ideas sound very promising, just using non-SG in these cases
should be enough to mitigate the performance regression we see.
Thank you in advance.
Alexandra
Powered by blists - more mailing lists