lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ABD64250-0CA0-4F4F-94D3-9AA4497E3518@oracle.com>
Date: Mon, 13 Oct 2025 14:04:20 +0000
From: Haakon Bugge <haakon.bugge@...cle.com>
To: Jacob Moroni <jmoroni@...gle.com>
CC: Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
        Sean
 Hefty <shefty@...dia.com>,
        Vlad Dumitrescu <vdumitrescu@...dia.com>,
        Or
 Har-Toov <ohartoov@...dia.com>,
        Manjunath Patil
	<manjunath.b.patil@...cle.com>,
        OFED mailing list
	<linux-rdma@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error
 message

Hi Jason and Jake,

> On 25 Sep 2025, at 13:29, Haakon Bugge <haakon.bugge@...cle.com> wrote:
> 
> [snip]

Got another incident this weekend. The first cm_destroy_id_wait_timeout() message at:

Oct 10 12:10:31 lab64 kernel: cm_destroy_id_wait_timeout: cm_id=00000000803feac1 timed out. state 2 -> 0, refcnt=2
[]

Health error in close proximity in time:

# devlink health show ens4f14 reporter tx
auxiliary/mlx5_core.eth.8/524288:
  reporter tx
    state healthy error 256 recover 256 last_dump_date 2025-10-10 last_dump_time 12:08:47 grace_period 500 auto_recover true auto_dump true

And also ndev timeout in close proximity in time:

# fgrep "TX timeout on queue" /var/log/messages-20251012 
Oct 10 12:11:16 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 5, SQ: 0xfb43, CQ: 0x66b, SQ Cons: 0x0 SQ Prod: 0x4, usecs since last trans: 20992000
Oct 10 13:01:05 lab64 kernel: mlx5_core 0000:13:01.0 ens4f14: TX timeout on queue: 3, SQ: 0x1088a, CQ: 0xcae, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 16390000
[]

4140 cm_ids in this situation:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
   4140    4140   95220

By running ib_send_bw using the suspected bad device:

[1]-  Running                 ib_send_bw -R -d mlx5_7 -i 2 &
[2]+  Running                 ib_send_bw -R -d mlx5_7 -i 1 192.168.64.15 &

Said processes are hanging and we got one more cm_id in this situation:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ print $3; }' | sort -n | uniq | wc
   4141    4141   95243

Stacks:

# cat /proc/435477/stack 
[<0>] ucma_get_event+0xf0/0x240 [rdma_ucm]
[<0>] ucma_write+0xf8/0x170 [rdma_ucm]
[<0>] vfs_write+0xd4/0x294
[<0>] ksys_write+0xbb/0xe8
[<0>] do_syscall_64+0x35/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0

# cat /proc/436606/stack 
[<0>] cm_destroy_id+0x16a/0x61c [ib_cm]
[<0>] _destroy_id+0x5a/0x230 [rdma_cm]
[<0>] ucma_destroy_private_ctx+0x354/0x370 [rdma_ucm]
[<0>] ucma_close+0x7a/0xb0 [rdma_ucm]
[<0>] __fput+0x96/0x24e
[<0>] task_work_run+0x65/0x9b
[<0>] do_exit+0x215/0x489
[<0>] do_group_exit+0x33/0x96
[<0>] get_signal+0x14a/0x967
[<0>] arch_do_signal_or_restart+0x10d/0x12a
[<0>] exit_to_user_mode_loop+0xc4/0x1b1
[<0>] exit_to_user_mode_prepare+0x124/0x12c
[<0>] syscall_exit_to_user_mode+0x18/0x42
[<0>] do_syscall_64+0x42/0x87
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x0

Distribution of old_state:

# dmesg | grep cm_destroy_id_wait_timeout | awk '{ ++state[$7]; } END {for (s in state) printf("%2d: %d\n", s, state[s]); }'
 0: 192
 2: 10546
 6: 102
11: 513
13: 148



My take is that the VF in question here gets whacked and that the MAD timeout handling does not resonate well with how CMA handles them.


Thxs, HÃ¥kon


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ