[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240419124632.60294-1-oxana@cloudflare.com>
Date: Fri, 19 Apr 2024 13:45:47 +0100
From: Oxana Kharitonova <oxana@...udflare.com>
To: netdev@...r.kernel.org
Cc: saeedm@...dia.com,
leon@...nel.org,
davem@...emloft.net,
edumazet@...gle.com,
kuba@...nel.org,
pabeni@...hat.com,
rrameshbabu@...dia.com,
oxana@...udflare.com,
kernel-team@...udflare.com
Subject: mlx5 driver fails to detect NIC in 6.6.28
Hello,
NIC stopped being detected in Linux 6.6.28. The problem was observed on
two servers, after reverting kernel to 6.6.25 (our current stable version)
everything returned to normal.
We suspect commit "net/mlx5e: Do not produce metadata freelist entries in
Tx port ts WQE xmit", but we haven't done bisect yet.
The kernel log is below.
root@...alhost:~# ifconfig
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 80 bytes 6480 (6.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 80 bytes 6480 (6.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
root@...alhost:~# lspci | grep Eth
c1:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
c1:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[ 23.519113] RIP: 0010:esw_port_metadata_get (drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:4095 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:2442) mlx5_core
[ 23.524293] usb 2-1: new SuperSpeed USB device number 2 using xhci_hcd
[ 23.528602] Code: eb 8e 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 89 d3 e8 f2 5d ea c9 48 8b 80 b0 09 00 00 <8b> 80 18 11 00 00 88 03 31 c0 80 23 01 5b e9 38 1f f3 c9 0f 1f 84
All code
========
0: eb 8e jmp 0xffffffffffffff90
2: 0f 1f 00 nopl (%rax)
5: 90 nop
6: 90 nop
7: 90 nop
8: 90 nop
9: 90 nop
a: 90 nop
b: 90 nop
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
10: 90 nop
11: 90 nop
12: 90 nop
13: 90 nop
14: 90 nop
15: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1a: 53 push %rbx
1b: 48 89 d3 mov %rdx,%rbx
1e: e8 f2 5d ea c9 call 0xffffffffc9ea5e15
23: 48 8b 80 b0 09 00 00 mov 0x9b0(%rax),%rax
2a:* 8b 80 18 11 00 00 mov 0x1118(%rax),%eax <-- trapping instruction
30: 88 03 mov %al,(%rbx)
32: 31 c0 xor %eax,%eax
34: 80 23 01 andb $0x1,(%rbx)
37: 5b pop %rbx
38: e9 38 1f f3 c9 jmp 0xffffffffc9f31f75
3d: 0f .byte 0xf
3e: 1f (bad)
3f: 84 .byte 0x84
Code starting with the faulting instruction
===========================================
0: 8b 80 18 11 00 00 mov 0x1118(%rax),%eax
6: 88 03 mov %al,(%rbx)
8: 31 c0 xor %eax,%eax
a: 80 23 01 andb $0x1,(%rbx)
d: 5b pop %rbx
e: e9 38 1f f3 c9 jmp 0xffffffffc9f31f4b
13: 0f .byte 0xf
14: 1f (bad)
15: 84 .byte 0x84
[ 23.528604] RSP: 0018:ffffc9000dbbfba8 EFLAGS: 00010282
[ 23.530802] hub 1-1:1.0: 4 ports detected
[ 23.537574] RAX: 0000000000000000 RBX: ffffc9000dbbfbfc RCX: 0000000000000028
[ 23.537576] RDX: ffffc9000dbbfbfc RSI: 0000000000000013 RDI: ffff88811ec38000
[ 23.537578] RBP: ffffffffc23fa560 R08: 0000000000000000 R09: 0000000000000000
[ 23.537580] R10: 0000000000036ea0 R11: 0000000000000dc0 R12: ffff889850104f00
[ 23.547568] usb 2-1: New USB device found, idVendor=05e3, idProduct=0620, bcdDevice=93.03
[ 23.564222] R13: ffff8881075ca840 R14: ffff88811ec38000 R15: 0000000000000000
[ 23.564224] FS: 0000000000000000(0000) GS:ffff88843fa00000(0000) knlGS:0000000000000000
[ 23.564226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.570143] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 23.574839] CR2: 0000000000001118 CR3: 0000000c7af5c000 CR4: 0000000000350ef0
[ 23.574841] Call Trace:
[ 23.574844] <TASK>
[ 23.574846] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
[ 23.590485] ? page_fault_oops (arch/x86/mm/fault.c:707)
[ 23.590490] ? get_page_from_freelist (mm/page_alloc.c:1553 mm/page_alloc.c:3177)
[ 23.606129] ? exc_page_fault (arch/x86/include/asm/irqflags.h:37 arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1504 arch/x86/mm/fault.c:1552)
[ 23.655856] ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:570)
[ 23.671501] ? esw_port_metadata_get (drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:4095 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:2442) mlx5_core
[ 23.790812] ? esw_port_metadata_get (drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:4095 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:2442) mlx5_core
[ 23.955180] devlink_nl_param_fill.constprop.0 (net/devlink/param.c:268)
[ 23.961276] ? __alloc_skb (net/core/skbuff.c:651 (discriminator 1))
[ 23.974490] ? srso_return_thunk (arch/x86/lib/retpoline.S:217)
[ 23.987013] ? __kmalloc_node_track_caller (mm/slab_common.c:1025 mm/slab_common.c:1046)
[ 23.987018] ? srso_return_thunk (arch/x86/lib/retpoline.S:217)
[ 23.997812] ? kmalloc_reserve (net/core/skbuff.c:584)
[ 23.997816] ? srso_return_thunk (arch/x86/lib/retpoline.S:217)
[ 24.007217] ? __alloc_skb (net/core/skbuff.c:666)
[ 24.007222] devlink_param_notify.constprop.0 (net/devlink/param.c:354 net/devlink/param.c:330)
[ 24.055512] devl_params_register (net/devlink/param.c:686 (discriminator 1))
[ 24.055516] esw_offloads_init (drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c:2482) mlx5_core
[ 24.065009] mlx5_eswitch_init (drivers/net/ethernet/mellanox/mlx5/core/eswitch.c:1872) mlx5_core
[ 24.165229] mlx5_init_one_devl_locked (drivers/net/ethernet/mellanox/mlx5/core/main.c:1022 drivers/net/ethernet/mellanox/mlx5/core/main.c:1447) mlx5_core
[ 24.177007] probe_one (drivers/net/ethernet/mellanox/mlx5/core/main.c:1507 drivers/net/ethernet/mellanox/mlx5/core/main.c:1947) mlx5_core
[ 24.187296] local_pci_probe (drivers/pci/pci-driver.c:325)
[ 24.196698] work_for_cpu_fn (kernel/workqueue.c:5618 (discriminator 1))
[ 24.205988] process_one_work (kernel/workqueue.c:2632)
[ 24.215612] worker_thread (kernel/workqueue.c:2694 (discriminator 2) kernel/workqueue.c:2781 (discriminator 2))
[ 24.224999] ? __pfx_worker_thread (kernel/workqueue.c:2727)
[ 24.234912] kthread (kernel/kthread.c:388)
[ 24.243647] ? __pfx_kthread (kernel/kthread.c:341)
[ 24.252988] ret_from_fork (arch/x86/kernel/process.c:153)
[ 24.262103] ? __pfx_kthread (kernel/kthread.c:341)
[ 24.271347] ret_from_fork_asm (arch/x86/entry/entry_64.S:314)
[ 24.280754] </TASK>
Powered by blists - more mailing lists