linux-kernel - modprobe mlx5_core on OCI bare-metal instance causes unrecoverable hang and I/O error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHTA-uaH9w2LqQdxY4b=7q9WQsuA6ntg=QRKrsf=mPfNBmM5pw@mail.gmail.com>
Date: Wed, 5 Feb 2025 17:09:13 -0600
From: Mitchell Augustin <mitchell.augustin@...onical.com>
To: saeedm@...dia.com, leon@...nel.org, tariqt@...dia.com, 
	andrew+netdev@...n.ch, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	netdev@...r.kernel.org, linux-rdma@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Cc: Talat Batheesh <talatb@...dia.com>, Feras Daoud <ferasda@...dia.com>
Subject: modprobe mlx5_core on OCI bare-metal instance causes unrecoverable
 hang and I/O error

Hello,

I have identified a bug in the mlx5_core module, or some related component.

Doing the following on a freshly provisioned Oracle Cloud bare metal
node with this configuration [0] will reliably cause the entire
instance to become unresponsive:

rmmod mlx5_ib; rmmod mlx5_core; modprobe mlx5_core

This also produces the following output:

[  331.267175] I/O error, dev sda, sector 35602992 op 0x0:(READ) flags
0x80700 phys_seg 33 prio class 0
[  331.376575] I/O error, dev sda, sector 35600432 op 0x0:(READ) flags
0x84700 phys_seg 320 prio class 0
[  331.487509] I/O error, dev sda, sector 35595064 op 0x0:(READ) flags
0x80700 phys_seg 159 prio class 0
[  528.386085] INFO: task kworker/u290:0:453 blocked for more than 122 seconds.
[  528.470497]       Not tainted 6.14.0-rc1 #1
[  528.520546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  528.615268] INFO: task kworker/u290:3:820 blocked for more than 123 seconds.
[  528.699641]       Not tainted 6.14.0-rc1 #1
[  528.749690] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  528.843577] INFO: task jbd2/sda1-8:1128 blocked for more than 123 seconds.
[  528.925922]       Not tainted 6.14.0-rc1 #1
[  528.975971] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  529.069854] INFO: task systemd-journal:1218 blocked for more than
123 seconds.
[  529.156382]       Not tainted 6.14.0-rc1 #1
[  529.206441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  529.300407] INFO: task kworker/u290:4:1828 blocked for more than 123 seconds.
[  529.385892]       Not tainted 6.14.0-rc1 #1
[  529.435942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  529.529973] INFO: task rs:main Q:Reg:2184 blocked for more than 124 seconds.
[  529.614607]       Not tainted 6.14.0-rc1 #1
[  529.664656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  529.758690] INFO: task gomon:2258 blocked for more than 124 seconds.
[  529.834832]       Not tainted 6.14.0-rc1 #1
[  529.884887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  529.978867] INFO: task kworker/u290:5:3255 blocked for more than 124 seconds.
[  530.064351]       Not tainted 6.14.0-rc1 #1
[  530.114398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  651.265588] INFO: task kworker/u290:0:453 blocked for more than 245 seconds.
[  651.349980]       Not tainted 6.14.0-rc1 #1
[  651.400028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  651.494126] INFO: task kworker/u290:3:820 blocked for more than 245 seconds.
[  651.578543]       Not tainted 6.14.0-rc1 #1
[  651.628600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

I tried using the function_graph tracer to identify if there were any
functions within mlx5_core that were executing for an excessive amount
of time, but did not find anything conclusive.

Attached[1] is the stack trace that I see when I force the kernel to
panic once a hang has been detected. I did this 3 times, and each
trace was similar in that they all referred to ext4_* functions, which
seems to line up with the I/O errors that I see each time.

I should also note that I was able to trigger a similar I/O error on a
DGX A100 one time (running Ubuntu-6.8.0-52-generic kernel and modules
installed via a repackaged version of DOCA-OFED) - but I have not been
able to reliably reproduce this issue on that machine with the pure
upstream inbox drivers, like I can with the OCI instance. (I was also
still able to interact with the A100 - but attempting to run any
command resulted in a "command not found" error, which again lines up
with the idea that this might have been interfering with ext4 somehow)

Has anything like this been observed by other users?

Please let me know if there is anything else I should do or provide to
help debug this issue, or if there is already a known root cause.

[0]
System specs:
OCI bare-metal Node, BM.Optimized3.36 shape with RoCE connectivity to
another identical node
Kernel: mainline @ 6.14.0-rc1 with this config:
https://pastebin.ubuntu.com/p/5Jm2WFZY62/
ibstat output: https://pastebin.ubuntu.com/p/S5dfFSdDxd/
lscpu output: https://pastebin.ubuntu.com/p/dfPyYQWnhX/

[1]
https://pastebin.ubuntu.com/p/kxw2dsmwFV/

-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering