[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHC9VhTvQLa=+Ykwmr_Uhgjrc6dfi24ou=NBsACkhwZN7X4EtQ@mail.gmail.com>
Date: Wed, 29 Mar 2023 21:27:45 -0400
From: Paul Moore <paul@...l-moore.com>
To: Saeed Mahameed <saeed@...nel.org>
Cc: Shay Drory <shayd@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
netdev@...r.kernel.org, regressions@...ts.linux.dev,
selinux@...r.kernel.org
Subject: Re: Potential regression/bug in net/mlx5 driver
On Wed, Mar 29, 2023 at 6:20 PM Saeed Mahameed <saeed@...nel.org> wrote:
> On 28 Mar 19:08, Paul Moore wrote:
> >Hello all,
> >
> >Starting with the v6.3-rcX kernel releases I noticed that my
> >InfiniBand devices were no longer present under /sys/class/infiniband,
> >causing some of my automated testing to fail. It took me a while to
> >find the time to bisect the issue, but I eventually identified the
> >problematic commit:
> >
> > commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4
> > Author: Shay Drory <shayd@...dia.com>
> > Date: Wed Jun 29 11:38:21 2022 +0300
> >
> > net/mlx5: Enable management PF initialization
> >
> > Enable initialization of DPU Management PF, which is a new loopback PF
> > designed for communication with BMC.
> > For now Management PF doesn't support nor require most upper layer
> > protocols so avoid them.
> >
> > Signed-off-by: Shay Drory <shayd@...dia.com>
> > Reviewed-by: Eran Ben Elisha <eranbe@...dia.com>
> > Reviewed-by: Moshe Shemesh <moshe@...dia.com>
> > Signed-off-by: Saeed Mahameed <saeedm@...dia.com>
> >
> >I'm not a mlx5 driver expert so I can't really offer much in the way
> >of a fix, but as a quick test I did remove the
> >'mlx5_core_is_management_pf(...)' calls in mlx5/core/dev.c and
> >everything seemed to work okay on my test system (or rather the tests
> >ran without problem).
> >
> >If you need any additional information, or would like me to test a
> >patch, please let me know.
>
> Hi Paul,
>
> Our team is looking into this, the current theory is that you have an old
> FW that doesn't have the correct capabilities set.
That's very possible; I installed this card many years ago and haven't
updated the FW once. I'm happy to update the FW (do you have a
pointer/how-to?), but it might be good to identify a fix first as I'm
guessing there will be others like me ...
> Can you please provide the FW version and the ConnectX device you are
> testing ?
>
> $ devlink dev info
% devlink dev info; echo $?
0
No output and no error code. However, I do see the following in dmesg:
[ 255.251124] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
959): fw query isn't supported by the FW
... which appears to support your theory about ancient hardware.
> $ lspci -s <pci_dev> -vv
While there is only one physical card, there are two PCI devices (it's
a dual port card). I'm only copying the first device since I'm
guessing that's really all you need:
% lspci -s 00:07.0 -vv
00:07.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Subsystem: Mellanox Technologies Device 0010
Physical Slot: 7
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 11
Region 0: Memory at fa000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fe900000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 25W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported,
ExtFmt- EETLPPrefix- EmergencyPowerReduction
Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
EqualizationPhase1+ EqualizationPhase2+
EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [48] Vital Product Data
Product Name: CX454A - ConnectX-4 QSFP28
Read-only fields:
[PN] Part number: MCX454A-FCAT
[EC] Engineering changes: AB
[SN] Serial number: MT1730X05081
[V0] Vendor specific: PCIeGen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
> since boot:
> $ dmesg
% devlink dev info
% dmesg | grep mlx5
[ 4.739691] mlx5_core 0000:00:07.0: firmware version: 12.18.1000
[ 4.740134] mlx5_core 0000:00:07.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[ 7.048567] mlx5_core 0000:00:07.0: Port module event: module 0,
Cable plugged
[ 7.211879] mlx5_core 0000:00:08.0: firmware version: 12.18.1000
[ 7.212309] mlx5_core 0000:00:08.0: 63.008 Gb/s available PCIe
bandwidth (8.0GT/s PCIe x8 link)
[ 7.897218] mlx5_core 0000:00:08.0: Port module event: module 1,
Cable plugged
[ 10.875388] mlx5_core 0000:00:07.0 ibs7: renamed from ib0
[ 10.995115] mlx5_core 0000:00:08.0 ibs8: renamed from ib0
[ 181.471663] mlx5_core 0000:00:07.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW
[ 181.472286] mlx5_core 0000:00:08.0: mlx5_fw_version_query:823:(pid
918): fw query isn't supported by the FW
--
paul-moore.com
Powered by blists - more mailing lists