lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1148fab-ecf3-46c1-9039-597cc80f3d28@nvidia.com>
Date: Tue, 23 Jul 2024 12:52:52 +0300
From: Carolina Jubran <cjubran@...dia.com>
To: Dragos Tatulea <dtatulea@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
 "daniel@...earbox.net" <daniel@...earbox.net>,
 "sdobron@...hat.com" <sdobron@...hat.com>, "hawk@...nel.org"
 <hawk@...nel.org>, "mianosebastiano@...il.com" <mianosebastiano@...il.com>
Cc: "toke@...hat.com" <toke@...hat.com>, "pabeni@...hat.com"
 <pabeni@...hat.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 "edumazet@...gle.com" <edumazet@...gle.com>,
 Saeed Mahameed <saeedm@...dia.com>, "bpf@...r.kernel.org"
 <bpf@...r.kernel.org>, "kuba@...nel.org" <kuba@...nel.org>
Subject: Re: XDP Performance Regression in recent kernel versions



On 22/07/2024 12:26, Dragos Tatulea wrote:
> On Sun, 2024-06-30 at 14:43 +0300, Tariq Toukan wrote:
>>
>> On 21/06/2024 15:35, Samuel Dobron wrote:
>>> Hey all,
>>>
>>> Yeah, we do tests for ELN kernels [1] on a regular basis. Since
>>> ~January of this year.
>>>
>>> As already mentioned, mlx5 is the only driver affected by this regression.
>>> Unfortunately, I think Jesper is actually hitting 2 regressions we noticed,
>>> the one already mentioned by Toke, another one [0] has been reported
>>> in early February.
>>> Btw. issue mentioned by Toke has been moved to Jira, see [5].
>>>
>>> Not sure all of you are able to see the content of [0], Jira says it's
>>> RH-confidental.
>>> So, I am not sure how much I can share without being fired :D. Anyway,
>>> affected kernels have been released a while ago, so anyone can find it
>>> on its own.
>>> Basically, we detected 5% regression on XDP_DROP+mlx5 (currently, we
>>> don't have data for any other XDP mode) in kernel-5.14 compared to
>>> previous builds.
>>>
>>>   From tests history, I can see (most likely) the same improvement
>>> on 6.10rc2 (from 15Mpps to 17-18Mpps), so I'd say 20% drop has been
>>> (partially) fixed?
>>>
>>> For earlier 6.10. kernels we don't have data due to [3] (there is regression on
>>> XDP_DROP as well, but I believe it's turbo-boost issue, as I mentioned
>>> in issue).
>>> So if you want to run tests on 6.10. please see [3].
>>>
>>> Summary XDP_DROP+mlx5@25G:
>>> kernel       pps
>>> <5.14        20.5M        baseline
>>>> =5.14      19M           [0]
>>> <6.4          19-20M      baseline for ELN kernels
>>>> =6.4        15M           [4 and 5] (mentioned by Toke)
>>
>> + @Dragos
>>
>> That's about when we added several changes to the RX datapath.
>> Most relevant are:
>> - Fully removing the in-driver RX page-cache.
>> - Refactoring to support XDP multi-buffer.
>>
>> We tested XDP performance before submission, I don't recall we noticed
>> such a degradation.
> 
> Adding Carolina to post her analysis on this.

Hey everyone,

After investigating the issue, it seems the performance degradation is 
linked to the commit "x86/bugs: Report Intel retbleed vulnerability"
(6ad0ad2bf8a67).

This commit addresses the Intel retbleed vulnerability and introduces
mitigation measures that impact performance, especially the Spectre v2
mitigations.


Disabling these mitigations in the kernel arguments
(spectre_v2=off ibrs=off) resolved the degradation in my tests.

Could you try adding the mentioned parameters to your kernel arguments
and check if you still see the degradation?

Thank you,

Carolina.

> 
>>
>> I'll check with Dragos as he probably has these reports.
>>
> We only noticed a 6% degradation for XDP_XDROP.
> 
> https://lore.kernel.org/netdev/b6fcfa8b-c2b3-8a92-fb6e-0760d5f6f5ff@redhat.com/T/
> 
>>>> =6.10      ???            [3]
>>>> =6.10rc2 17M-18M
>>>
>>>
>>>> It looks like this is known since March, was this ever reported to Nvidia back
>>>> then? :/
>>>
>>> Not sure if that's a question for me, I was told, filling an issue in
>>> Bugzilla/Jira is where
>>> our competences end. Who is supposed to report it to them?
>>>
>>>> Given XDP is in the critical path for many in production, we should think about
>>>> regular performance reporting for the different vendors for each released kernel,
>>>> similar to here [0].
>>>
>>> I think this might be the part of upstream kernel testing with LNST?
>>> Maybe Jesper
>>> knows more about that? Until then, I think, I can let you know about
>>> new regressions we catch.
>>>
>>> Thanks,
>>> Sam.
>>>
>>> [0] https://issues.redhat.com/browse/RHEL-24054
>>> [1] https://koji.fedoraproject.org/koji/search?terms=kernel-%5Cd.*eln*&type=build&match=regexp
>>> [2] https://koji.fedoraproject.org/koji/buildinfo?buildID=2469107
>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=2282969
>>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=2270408
>>> [5] https://issues.redhat.com/browse/RHEL-24054
>>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ