[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZGW1shjKZFCn28LW@TONYMAC-ALIBABA.local>
Date: Thu, 18 May 2023 13:20:50 +0800
From: Tony Lu <tonylu@...ux.alibaba.com>
To: Wen Gu <guwen@...ux.alibaba.com>
Cc: kgraul@...ux.ibm.com, wenjia@...ux.ibm.com, jaka@...ux.ibm.com,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
pabeni@...hat.com, linux-s390@...r.kernel.org,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
liuyacan@...p.netease.com
Subject: Re: [PATCH net] net/smc: Reset connection when trying to use SMCRv2
fails.
On Thu, May 18, 2023 at 01:14:55PM +0800, Wen Gu wrote:
> We found a crash when using SMCRv2 with 2 Mellanox ConnectX-4. It
> can be reproduced by:
>
> - smc_run nginx
> - smc_run wrk -t 32 -c 500 -d 30 http://<ip>:<port>
>
> BUG: kernel NULL pointer dereference, address: 0000000000000014
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 8000000108713067 P4D 8000000108713067 PUD 151127067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 4 PID: 2441 Comm: kworker/4:249 Kdump: loaded Tainted: G W E 6.4.0-rc1+ #42
> Workqueue: smc_hs_wq smc_listen_work [smc]
> RIP: 0010:smc_clc_send_confirm_accept+0x284/0x580 [smc]
> RSP: 0018:ffffb8294b2d7c78 EFLAGS: 00010a06
> RAX: ffff8f1873238880 RBX: ffffb8294b2d7dc8 RCX: 0000000000000000
> RDX: 00000000000000b4 RSI: 0000000000000001 RDI: 0000000000b40c00
> RBP: ffffb8294b2d7db8 R08: ffff8f1815c5860c R09: 0000000000000000
> R10: 0000000000000400 R11: 0000000000000000 R12: ffff8f1846f56180
> R13: ffff8f1815c5860c R14: 0000000000000001 R15: 0000000000000001
> FS: 0000000000000000(0000) GS:ffff8f1aefd00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000014 CR3: 00000001027a0001 CR4: 00000000003706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> ? mlx5_ib_map_mr_sg+0xa1/0xd0 [mlx5_ib]
> ? smcr_buf_map_link+0x24b/0x290 [smc]
> ? __smc_buf_create+0x4ee/0x9b0 [smc]
> smc_clc_send_accept+0x4c/0xb0 [smc]
> smc_listen_work+0x346/0x650 [smc]
> ? __schedule+0x279/0x820
> process_one_work+0x1e5/0x3f0
> worker_thread+0x4d/0x2f0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xe5/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2c/0x50
> </TASK>
>
> During the CLC handshake, server sequentially tries available SMCRv2
> and SMCRv1 devices in smc_listen_work().
>
> If an SMCRv2 device is found. SMCv2 based link group and link will be
> assigned to the connection. Then assumed that some buffer assignment
> errors happen later in the CLC handshake, such as RMB registration
> failure, server will give up SMCRv2 and try SMCRv1 device instead. But
> the resources assigned to the connection won't be reset.
>
> When server tries SMCRv1 device, the connection creation process will
> be executed again. Since conn->lnk has been assigned when trying SMCRv2,
> it will not be set to the correct SMCRv1 link in
> smcr_lgr_conn_assign_link(). So in such situation, conn->lgr points to
> correct SMCRv1 link group but conn->lnk points to the SMCRv2 link
> mistakenly.
>
> Then in smc_clc_send_confirm_accept(), conn->rmb_desc->mr[link->link_idx]
> will be accessed. Since the link->link_idx is not correct, the related
> MR may not have been initialized, so crash happens.
>
> | Try SMCRv2 device first
> | |-> conn->lgr: assign existed SMCRv2 link group;
> | |-> conn->link: assign existed SMCRv2 link (link_idx may be 1 in SMC_LGR_SYMMETRIC);
> | |-> sndbuf & RMB creation fails, quit;
> |
> | Try SMCRv1 device then
> | |-> conn->lgr: create SMCRv1 link group and assign;
> | |-> conn->link: keep SMCRv2 link mistakenly;
> | |-> sndbuf & RMB creation succeed, only RMB->mr[link_idx = 0]
> | initialized.
> |
> | Then smc_clc_send_confirm_accept() accesses
> | conn->rmb_desc->mr[conn->link->link_idx, which is 1], then crash.
> v
>
> This patch tries to fix this by cleaning conn->lnk before assigning
> link. In addition, it is better to reset the connection and clean the
> resources assigned if trying SMCRv2 failed in buffer creation or
> registration.
>
> Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2")
> Link: https://lore.kernel.org/r/20220523055056.2078994-1-liuyacan@corp.netease.com/
> Signed-off-by: Wen Gu <guwen@...ux.alibaba.com>
LGTM, thanks for your detailed analysis.
Reviewed-by: Tony Lu <tonylu@...ux.alibaba.com>
> ---
> net/smc/af_smc.c | 9 +++++++--
> net/smc/smc_core.c | 1 +
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 50c38b6..538e9c6 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2000,8 +2000,10 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc,
> return rc;
>
> /* create send buffer and rmb */
> - if (smc_buf_create(new_smc, false))
> + if (smc_buf_create(new_smc, false)) {
> + smc_conn_abort(new_smc, ini->first_contact_local);
> return SMC_CLC_DECL_MEM;
> + }
>
> return 0;
> }
> @@ -2217,8 +2219,11 @@ static void smc_find_rdma_v2_device_serv(struct smc_sock *new_smc,
> smcr_version = ini->smcr_version;
> ini->smcr_version = SMC_V2;
> rc = smc_listen_rdma_init(new_smc, ini);
> - if (!rc)
> + if (!rc) {
> rc = smc_listen_rdma_reg(new_smc, ini->first_contact_local);
> + if (rc)
> + smc_conn_abort(new_smc, ini->first_contact_local);
> + }
> if (!rc)
> return;
> ini->smcr_version = smcr_version;
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 4543567..3f465fa 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -127,6 +127,7 @@ static int smcr_lgr_conn_assign_link(struct smc_connection *conn, bool first)
> int i, j;
>
> /* do link balancing */
> + conn->lnk = NULL; /* reset conn->lnk first */
> for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
> struct smc_link *lnk = &conn->lgr->lnk[i];
>
> --
> 1.8.3.1
Powered by blists - more mailing lists