netdev - Re: Potential regression/bug in net/mlx5 driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZDnRkVNYlHk4QVqy@x130>
Date:   Fri, 14 Apr 2023 15:20:01 -0700
From:   Saeed Mahameed <saeedm@...dia.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Paul Moore <paul@...l-moore.com>,
        Leon Romanovsky <leon@...nel.org>,
        Linux regressions mailing list <regressions@...ts.linux.dev>,
        Saeed Mahameed <saeed@...nel.org>,
        Shay Drory <shayd@...dia.com>, netdev@...r.kernel.org,
        selinux@...r.kernel.org, Tariq Toukan <tariqt@...dia.com>
Subject: Re: Potential regression/bug in net/mlx5 driver

On 13 Apr 20:26, Jakub Kicinski wrote:
>On Thu, 13 Apr 2023 20:03:18 -0700 Saeed Mahameed wrote:
>> On 13 Apr 15:51, Jakub Kicinski wrote:
>> >On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:
>> >> But this management connection function has the same architecture as other
>> >> "Normal" mlx5 functions, from the driver pov. The same way mlx5
>> >> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
>> >> a "management function".
>> >
>> >Yes, and that's why every single IPU implementation thinks that it's
>> >a great idea. Because it's easy to implement. But what is it for
>> >architecturally? Running what is effectively FW commands over TCP?
>>
>> Where did you get this idea from? maybe we got the name wrong,
>> "management PF" is simply a minimalistic netdev PF to have eth connection
>> with the on board BMC ..
>>
>> I agree that the name "management PF" sounds scary, but it is not a control
>> function as you think, not at all. As the original commit message states:
>> "loopback PF designed for communication with BMC".
>
>Can you draw a small diagram with the bare metal guest, IPU, and BMC?
>What's talking to what? And what packets are exchanged?
>

Yes, Working on that...

>> >> But let's discuss what's wrong with it, and what are your thoughts ?
>> >> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.
>> >
>> >Right, the breakage is a separate topic.
>> >
>> >You say 6 years old but the part is EOL, right? The part is old and
>> >stable, AFAIU the breakage stems from development work for parts which
>> >are 3 or so generations newer.
>>
>> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
>> driver makes it really hard to test all the possible combinations, so we
>> need to be strict with how back we want to officially support and test old
>> generations.
>
>Would you be able to pull the datapoints for what 3 GA FWs means
>in case of CX4? Release number and date when it was released?
>

https://network.nvidia.com/files/related-docs/eol/LCR-000821.pdf

Since CX4 was EOL last year, it is going to be hard to find this info but
let me check my email archive.. 

12.28.2006   27-Sep-20 - recommended version
12.26.xxxx   12-Dec-2019
12.24.1000   2-Dec-18


>I understand the challenge of backward compat with a multi-gen
>driver. It's a trade off.
>
>> >The question is who's supposed to be paying the price of mlx5 being
>> >used for old and new parts? What is fair to expect from the user
>> >when the FW Paul has presumably works just fine for him?
>> >
>> Upgrade FW when possible, it is always easier than upgrading the kernel.
>> Anyways this was a very rare FW/Arch bug, We should've exposed an
>> explicit cap for this new type of PF when we had the chance, now it's too
>> late since a proper fix will require FW and Driver upgrades and breaking
>> the current solution we have over other OSes as well.
>>
>> Yes I can craft an if condition to explicitly check for chip id and FW
>> version for this corner case, which has no precedence in mlx5, but I prefer
>> to ask to upgrade FW first, and if that's an acceptable solution, I would
>> like to keep the mlx5 clean and device agnostic as much as possible.
>
>IMO you either need a fully fleshed out FW update story, with advanced
>warnings for a few releases, distributing the FW via linux-firmware or
>fwupdmgr or such.  Or deal with the corner cases in the driver :(
>

Completely agree, I will start an internal discussion .. 

>We can get Paul to update, sure, but if he noticed so quickly the
>question remains how many people out in the wild will get affected
>and not know what the cause is?

Right, I will make sure this will be addressed, will let you know how we
will handle this, will try to post a patch early next cycle, but i will
need to work with Arch and release managers for this, so it will take a
couple of weeks to formalize a proper solution.