[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ABD64250-0CA0-4F4F-94D3-9AA4497E3518@oracle.com>
Date: Mon, 13 Oct 2025 14:04:20 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jacob Moroni <jmoroni@...gle.com>
CC: Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
Sean
Hefty <shefty@...dia.com>,
Vlad Dumitrescu <vdumitrescu@...dia.com>,
Or
Har-Toov <ohartoov@...dia.com>,
Manjunath Patil
<manjunath.b.patil@...cle.com>,
OFED mailing list
<linux-rdma@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
message
Hi Jason and Jake,
> On 25 Sep 2025, at 13:29, Haakon Bugge <haakon.bugge@...cle.com> wrote:
>
> [snip]
Got another incident this weekend. The first cm_destroy_id_wait_timeout() message at:
Oct 10 12:10:31 lab64 kernel: cm_destroy_id_wait_timeout: cm_id=00000000803feac1 timed out. state 2 -> 0, refcnt=2
[]
Health error in close proximity in time:
# devlink health show ens4f14 reporter tx
auxiliary/mlx5_core.eth.8/524288:
reporter tx
state healthy error 256 recover 256 last_dump_date 2025-10-10 last_dump_time 12:08:47 grace_period 500 auto_recover true auto_dump true
And also ndev timeout in close proximity in time:
# fgrep "TX timeout on queue" /var/log/messages-20251012
Oct 10 12:11:16 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 5, SQ: 0xfb43, CQ: 0x66b, SQ Cons: 0x0 SQ Prod: 0x4, usecs since last trans: 20992000
Oct 10 13:01:05 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 3, SQ: 0x1088a, CQ: 0xcae, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 16390000
[]
4140 cm_ids in this situation:
# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
4140 4140 95220
By running ib_send_bw using the suspected bad device:
[1]- Running ib_send_bw -R -d mlx5_7 -i 2 &
[2]+ Running ib_send_bw -R -d mlx5_7 -i 1 192.168.64.15 &
Said processes are hanging and we got one more cm_id in this situation:
# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
4141 4141 95243
Stacks:
# cat /proc/435477/stack
[<0>] ucma_get_event+0xf0/0x240 [rdma_ucm]
[<0>] ucma_write+0xf8/0x170 [rdma_ucm]
[<0>] vfs_write+0xd4/0x294
[<0>] ksys_write+0xbb/0xe8
[<0>] do_syscall_64+0x35/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0
# cat /proc/436606/stack
[<0>] cm_destroy_id+0x16a/0x61c [ib_cm]
[<0>] _destroy_id+0x5a/0x230 [rdma_cm]
[<0>] ucma_destroy_private_ctx+0x354/0x370 [rdma_ucm]
[<0>] ucma_close+0x7a/0xb0 [rdma_ucm]
[<0>] __fput+0x96/0x24e
[<0>] task_work_run+0x65/0x9b
[<0>] do_exit+0x215/0x489
[<0>] do_group_exit+0x33/0x96
[<0>] get_signal+0x14a/0x967
[<0>] arch_do_signal_or_restart+0x10d/0x12a
[<0>] exit_to_user_mode_loop+0xc4/0x1b1
[<0>] exit_to_user_mode_prepare+0x124/0x12c
[<0>] syscall_exit_to_user_mode+0x18/0x42
[<0>] do_syscall_64+0x42/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0
Distribution of old_state:
# dmesg | grep cm_destroy_id_wait_timeout | awk '{ ++state[$7]; } END {for (s in state) printf("%2d: %d\n", s, state[s]); }'
0: 192
2: 10546
6: 102
11: 513
13: 148
My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them.
Thxs, HÃ¥kon
Powered by blists - more mailing lists