linux-kernel - Re: [PATCH RFC] RDMA/cm: add timeout to cm_destroy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <afab7f4d-d8cd-4582-b46c-d4fa7673f429@oracle.com>
Date: Thu, 7 Mar 2024 10:00:49 -0800
From: Manjunath Patil <manjunath.b.patil@...cle.com>
To: Leon Romanovsky <leon@...nel.org>
Cc: dledford@...hat.com, jgg@...pe.ca, linux-rdma@...r.kernel.org,
        linux-kernel@...r.kernel.org, rama.nichanamatlu@...cle.com
Subject: Re: [PATCH RFC] RDMA/cm: add timeout to cm_destroy_id wait


On 3/7/24 1:46 AM, Leon Romanovsky wrote:
> On Tue, Mar 05, 2024 at 02:59:11PM -0800, Manjunath Patil wrote:
>>
>>
>> On 3/3/24 1:58 AM, Leon Romanovsky wrote:
>>> On Tue, Feb 27, 2024 at 12:00:17PM -0800, Manjunath Patil wrote:
>>>> Add timeout to cm_destroy_id, so that userspace can trigger any data
>>>> collection that would help in analyzing the cause of delay in destroying
>>>> the cm_id.
>>>
>>> Why doesn't rdmatool resource cm_id dump help to see stalled cm_ids?
>> Wouldn't this require us to know cm_id before hand?
>>
>> I am unfamiliar with rdmatool. Can you explain how I use it to detect a stalled connection?
> 
> Please see it if it can help:
> https://urldefense.com/v3/__https://www.man7.org/linux/man-pages/man8/rdma-resource.8.html__;!!ACWV5N9M2RV99hQ!K0TW8CNb7aCJsJbLV7Uhy9K_rOdMuHsHoUki2GIaxxQt0hd31pUCi9FeGMx2hTNI0j48ju1D6KiIrDNJ254$
> rdma resource show cm_id ...
> 
Thank you for the reference. I will go through it.
>> I wouldn't know cm_id before hand to track it to see if that is stalled.
>>
>> My intention is to have a script monitor for stalled connections[Ex: one of my connections was stuck in destroying it's cm_id] and trigger things like firmware dumps, enable more logging in related modules, crash node if this takes longer than few minutes etc.
>>
>> The current logic is to, have this timeout trigger a function(which is traceable with ebpf/dtrace) in error path, if more than expected time is spent is destroying the cm_id.
> 
> I'm not against the idea to warn about stalled destroy_id, I'm against
> adding new knob to control this timeout.
> 
Thank you for sharing this. Adding sysctl isn't critical for me. I am happy to get rid of it.
I will send a v2 with control knob removed.

-Thanks,
Manjunath
> Thanks
> 
>>
>> -Thank you,
>> Manjunath
>>>
>>> Thanks
>>>
>>>>
>>>> New noinline function helps dtrace/ebpf programs to hook on to it.
>>>> Existing functionality isn't changed except triggering a probe-able new
>>>> function at every timeout interval.
>>>>
>>>> We have seen cases where CM messages stuck with MAD layer (either due to
>>>> software bug or faulty HCA), leading to cm_id getting stuck in the
>>>> following call stack. This patch helps in resolving such issues faster.
>>>>
>>>> kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
>>>> ...
>>>> 	Call Trace:
>>>> 	__schedule+0x2bc/0x895
>>>> 	schedule+0x36/0x7c
>>>> 	schedule_timeout+0x1f6/0x31f
>>>>    	? __slab_free+0x19c/0x2ba
>>>> 	wait_for_completion+0x12b/0x18a
>>>> 	? wake_up_q+0x80/0x73
>>>> 	cm_destroy_id+0x345/0x610 [ib_cm]
>>>> 	ib_destroy_cm_id+0x10/0x20 [ib_cm]
>>>> 	rdma_destroy_id+0xa8/0x300 [rdma_cm]
>>>> 	ucma_destroy_id+0x13e/0x190 [rdma_ucm]
>>>> 	ucma_write+0xe0/0x160 [rdma_ucm]
>>>> 	__vfs_write+0x3a/0x16d
>>>> 	vfs_write+0xb2/0x1a1
>>>> 	? syscall_trace_enter+0x1ce/0x2b8
>>>> 	SyS_write+0x5c/0xd3
>>>> 	do_syscall_64+0x79/0x1b9
>>>> 	entry_SYSCALL_64_after_hwframe+0x16d/0x0
>>>>
>>>> Signed-off-by: Manjunath Patil <manjunath.b.patil@...cle.com>
>>>> ---
>>>>    drivers/infiniband/core/cm.c | 38 +++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 37 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>>> index ff58058aeadc..03f7b80efa77 100644
>>>> --- a/drivers/infiniband/core/cm.c
>>>> +++ b/drivers/infiniband/core/cm.c
>>>> @@ -34,6 +34,20 @@ MODULE_AUTHOR("Sean Hefty");
>>>>    MODULE_DESCRIPTION("InfiniBand CM");
>>>>    MODULE_LICENSE("Dual BSD/GPL");
>>>> +static unsigned long cm_destroy_id_wait_timeout_sec = 10;
>>>> +
>>>> +static struct ctl_table_header *cm_ctl_table_header;
>>>> +static struct ctl_table cm_ctl_table[] = {
>>>> +	{
>>>> +		.procname	= "destroy_id_wait_timeout_sec",
>>>> +		.data		= &cm_destroy_id_wait_timeout_sec,
>>>> +		.maxlen		= sizeof(cm_destroy_id_wait_timeout_sec),
>>>> +		.mode		= 0644,
>>>> +		.proc_handler	= proc_doulongvec_minmax,
>>>> +	},
>>>> +	{ }
>>>> +};
>>>> +
>>>>    static const char * const ibcm_rej_reason_strs[] = {
>>>>    	[IB_CM_REJ_NO_QP]			= "no QP",
>>>>    	[IB_CM_REJ_NO_EEC]			= "no EEC",
>>>> @@ -1025,10 +1039,20 @@ static void cm_reset_to_idle(struct cm_id_private *cm_id_priv)
>>>>    	}
>>>>    }
>>>> +static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id)
>>>> +{
>>>> +	struct cm_id_private *cm_id_priv;
>>>> +
>>>> +	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
>>>> +	pr_err("%s: cm_id=%p timed out. state=%d refcnt=%d\n", __func__,
>>>> +	       cm_id, cm_id->state, refcount_read(&cm_id_priv->refcount));
>>>> +}
>>>> +
>>>>    static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
>>>>    {
>>>>    	struct cm_id_private *cm_id_priv;
>>>>    	struct cm_work *work;
>>>> +	int ret;
>>>>    	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
>>>>    	spin_lock_irq(&cm_id_priv->lock);
>>>> @@ -1135,7 +1159,14 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
>>>>    	xa_erase(&cm.local_id_table, cm_local_id(cm_id->local_id));
>>>>    	cm_deref_id(cm_id_priv);
>>>> -	wait_for_completion(&cm_id_priv->comp);
>>>> +	do {
>>>> +		ret = wait_for_completion_timeout(&cm_id_priv->comp,
>>>> +						  msecs_to_jiffies(
>>>> +				cm_destroy_id_wait_timeout_sec * 1000));
>>>> +		if (!ret) /* timeout happened */
>>>> +			cm_destroy_id_wait_timeout(cm_id);
>>>> +	} while (!ret);
>>>> +
>>>>    	while ((work = cm_dequeue_work(cm_id_priv)) != NULL)
>>>>    		cm_free_work(work);
>>>> @@ -4505,6 +4536,10 @@ static int __init ib_cm_init(void)
>>>>    	ret = ib_register_client(&cm_client);
>>>>    	if (ret)
>>>>    		goto error3;
>>>> +	cm_ctl_table_header = register_net_sysctl(&init_net,
>>>> +						  "net/ib_cm", cm_ctl_table);
>>>> +	if (!cm_ctl_table_header)
>>>> +		pr_warn("ib_cm: couldn't register sysctl path, using default values\n");
>>>>    	return 0;
>>>>    error3:
>>>> @@ -4522,6 +4557,7 @@ static void __exit ib_cm_cleanup(void)
>>>>    		cancel_delayed_work(&timewait_info->work.work);
>>>>    	spin_unlock_irq(&cm.lock);
>>>> +	unregister_net_sysctl_table(cm_ctl_table_header);
>>>>    	ib_unregister_client(&cm_client);
>>>>    	destroy_workqueue(cm.wq);
>>>> -- 
>>>> 2.31.1
>>>>
>>>>