lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230413202631.7e3bd713@kernel.org>
Date:   Thu, 13 Apr 2023 20:26:31 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Saeed Mahameed <saeedm@...dia.com>
Cc:     Paul Moore <paul@...l-moore.com>,
        Leon Romanovsky <leon@...nel.org>,
        Linux regressions mailing list <regressions@...ts.linux.dev>,
        Saeed Mahameed <saeed@...nel.org>,
        Shay Drory <shayd@...dia.com>, netdev@...r.kernel.org,
        selinux@...r.kernel.org, Tariq Toukan <tariqt@...dia.com>
Subject: Re: Potential regression/bug in net/mlx5 driver

On Thu, 13 Apr 2023 20:03:18 -0700 Saeed Mahameed wrote:
> On 13 Apr 15:51, Jakub Kicinski wrote:
> >On Thu, 13 Apr 2023 15:34:21 -0700 Saeed Mahameed wrote:  
> >> But this management connection function has the same architecture as other
> >> "Normal" mlx5 functions, from the driver pov. The same way mlx5
> >> doesn't care if the underlaying function is CX4/5/6 we don't care if it was
> >> a "management function".  
> >
> >Yes, and that's why every single IPU implementation thinks that it's
> >a great idea. Because it's easy to implement. But what is it for
> >architecturally? Running what is effectively FW commands over TCP?  
> 
> Where did you get this idea from? maybe we got the name wrong, 
> "management PF" is simply a minimalistic netdev PF to have eth connection
> with the on board BMC .. 
> 
> I agree that the name "management PF" sounds scary, but it is not a control
> function as you think, not at all. As the original commit message states:
> "loopback PF designed for communication with BMC".

Can you draw a small diagram with the bare metal guest, IPU, and BMC?
What's talking to what? And what packets are exchanged?

> >> But let's discuss what's wrong with it, and what are your thoughts ?
> >> the fact that it breaks a 6 years OLD FW, doesn't make it so horrible.  
> >
> >Right, the breakage is a separate topic.
> >
> >You say 6 years old but the part is EOL, right? The part is old and
> >stable, AFAIU the breakage stems from development work for parts which
> >are 3 or so generations newer.
> 
> Officially we test only 3 GA FWs back. The fact that mlx5 is a generic CX
> driver makes it really hard to test all the possible combinations, so we
> need to be strict with how back we want to officially support and test old
> generations.

Would you be able to pull the datapoints for what 3 GA FWs means 
in case of CX4? Release number and date when it was released?

I understand the challenge of backward compat with a multi-gen
driver. It's a trade off.

> >The question is who's supposed to be paying the price of mlx5 being
> >used for old and new parts? What is fair to expect from the user
> >when the FW Paul has presumably works just fine for him?
> >  
> Upgrade FW when possible, it is always easier than upgrading the kernel.
> Anyways this was a very rare FW/Arch bug, We should've exposed an
> explicit cap for this new type of PF when we had the chance, now it's too
> late since a proper fix will require FW and Driver upgrades and breaking
> the current solution we have over other OSes as well.
>
> Yes I can craft an if condition to explicitly check for chip id and FW
> version for this corner case, which has no precedence in mlx5, but I prefer
> to ask to upgrade FW first, and if that's an acceptable solution, I would
> like to keep the mlx5 clean and device agnostic as much as possible.

IMO you either need a fully fleshed out FW update story, with advanced
warnings for a few releases, distributing the FW via linux-firmware or
fwupdmgr or such.  Or deal with the corner cases in the driver :(

We can get Paul to update, sure, but if he noticed so quickly the
question remains how many people out in the wild will get affected 
and not know what the cause is?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ