linux-kernel - [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR21MB12703228F3E7A8B8158EB054BF129@BYAPR21MB1270.namprd21.prod.outlook.com>
Date:   Thu, 15 Jul 2021 00:38:56 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>
CC:     Haiyang Zhang <haiyangz@...rosoft.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

Hi all,
I'm seeing a strange "no MSI-X interrupt" issue with the Mellanox NIC
driver on a physical Linux host [1], if I only enable part of the CPUs.

The physical host has 104 logical processors (2 sockets, and each socket
has 26 cores with HT enabled). By default, the Mellanox driver works fine
when Linux boots up.

If I only use 1, 2, 32, 64, 96 processors by the Linux kernel parameter
"maxcpus=X" or "nr_cpus=X", everthing still works fine.

However, if the Linux host OS only uses 4, 8 or 16 processors, the
mlx5_core driver fails to load as it can not receive interrupt when
creating EQ (maxcpus=8 or 16), or the driver can load but it reports a
timeout error when I try to bring the NIC up (maxcpus=4). This issue is
a 100% repro.

For example, with "maxcpus=8", I get the below timeout error when trying
to load mlx5_core:

# modprobe mlx5_core
[ 1475.716688] mlx5_core 0000:d8:00.0: firmware version: 16.25.8352
[ 1475.722742] mlx5_core 0000:d8:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 1475.991398] mlx5_core 0000:d8:00.0: E-Switch: Total vports 2, per vport: max uc(1024) max mc(16384)

[ 1537.020001] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1537.028969] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1598.460003] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1598.468978] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1659.900010] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1659.908987] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout
[ 1721.340006] mlx5_core 0000:d8:00.0: mlx5_cmd_eq_recover:245:(pid 1416): Recovered 1 EQEs on cmd_eq
[ 1721.348989] mlx5_core 0000:d8:00.0: wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]: CREATE_EQ(0x301) recovered after timeout

When this happens, the mlx5_core driver is stuck with the below
call-trace, waiting for some interrupt:

# ps aux |grep modprobe
root        1416  0.0  0.0  11024  1472 ttyS0    D+   08:08   0:00 modprobe mlx5_core
root        1480  0.0  0.0   6440   736 pts/0    S+   08:15   0:00 grep --color=auto modprobe

# cat /proc/1416/stack
[<0>] cmd_exec+0x8a7/0x9b0 [mlx5_core]
[<0>] mlx5_cmd_exec+0x24/0x50 [mlx5_core]
[<0>] create_map_eq+0x2a6/0x380 [mlx5_core]
[<0>] mlx5_eq_table_create+0x504/0x710 [mlx5_core]
[<0>] mlx5_load+0x52/0x130 [mlx5_core]
[<0>] mlx5_init_one+0x1cc/0x250 [mlx5_core]
[<0>] probe_one+0x1d3/0x2a0 [mlx5_core]
[<0>] local_pci_probe+0x45/0x80
[<0>] pci_device_probe+0x10f/0x1c0
[<0>] really_probe+0x1c1/0x3b0
[<0>] __driver_probe_device+0x109/0x180
[<0>] driver_probe_device+0x23/0xa0
[<0>] __driver_attach+0xbd/0x160
[<0>] bus_for_each_dev+0x7c/0xc0
[<0>] driver_attach+0x1e/0x20
[<0>] bus_add_driver+0x152/0x1f0
[<0>] driver_register+0x74/0xd0
[<0>] __pci_register_driver+0x68/0x70
[<0>] init+0x6b/0x1000 [mlx5_core]
[<0>] do_one_initcall+0x46/0x1d0
[<0>] do_init_module+0x62/0x250
[<0>] load_module+0x2503/0x2730
[<0>] __do_sys_finit_module+0xbf/0x120
[<0>] __x64_sys_finit_module+0x1a/0x20
[<0>] do_syscall_64+0x38/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

To make the issue even weirder, when the issue happens (e.g. when Linux
only uses 8 processors), if I manually bring CPU #8~#31 online [2] and
then bring them offline [3], the Mellanox driver will work fine!

This is a x86-64 host. Is it possibe that the IOMMU Interrrupt Remapping
is not proprely set up with maxcpus=4, 8 and 16?

The above tests were done with the recent Linux v5.14-rc1 kernel. I also
tried Ubuntu 20.04's kernel "5.4.0-77-generic", and the Mellanox driver
exhibits exactly the same issue.

I have Linux/Windows dual-boot on this physical machine, and Windows
doesn't have the issue when I let it only use 4, 8 and 16 processors.
So this looks like somehow the issue is specific to Linux.

Can someone please shed some light on this strange issue? I'm ready
to provide more logs if needed. Thanks!

PS, the physical machine has 4 NVMe controllers and 4 Broadcom NICs,
which are not affected by maxcpus=4, 8, and 16.

[1] This is the 'lspci' output of the Mellanox NIC:
d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
        Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
        Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA node 1
        Memory at f8000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at fbe00000 [disabled] [size=1M]
        Capabilities: [60] Express Endpoint, MSI 00
        Capabilities: [48] Vital Product Data
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1c0] Secondary PCI Express
        Kernel driver in use: mlx5_core
        Kernel modules: mlx5_core
00: b3 15 17 10 46 05 10 00 00 00 00 02 08 00 00 00
10: 0c 00 00 f8 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b3 15 80 00
30: 00 00 e0 fb 60 00 00 00 00 00 00 00 ff 01 00 00

[2] for i in `seq 8 31`;  do echo 1 >  /sys/devices/system/cpu/cpu$i/online; done
[3] for i in `seq 8 31`;  do echo 0 >  /sys/devices/system/cpu/cpu$i/online; done

Thanks,
-- Dexuan