linux-kernel - Re: [PATCH v4 3/3] nvme-fc: do not ignore connectivity loss during connecting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cab2575c-037d-4d9d-896c-3bd2c64c9a0b@suse.de>
Date: Mon, 20 Jan 2025 14:45:46 +0100
From: Hannes Reinecke <hare@...e.de>
To: Daniel Wagner <wagi@...nel.org>, James Smart <james.smart@...adcom.com>,
 Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>,
 Sagi Grimberg <sagi@...mberg.me>, Paul Ely <paul.ely@...adcom.com>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 3/3] nvme-fc: do not ignore connectivity loss during
 connecting

On 1/9/25 14:30, Daniel Wagner wrote:
> When a connectivity loss occurs while nvme_fc_create_assocation is
> being executed, it's possible that the ctrl ends up stuck in the LIVE
> state:
> 
>    1) nvme nvme10: NVME-FC{10}: create association : ...
>    2) nvme nvme10: NVME-FC{10}: controller connectivity lost.
>                    Awaiting Reconnect
>       nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
>    3) nvme nvme10: Could not set queue count (880)
>       nvme nvme10: Failed to configure AEN (cfg 900)
>    4) nvme nvme10: NVME-FC{10}: controller connect complete
>    5) nvme nvme10: failed nvme_keep_alive_end_io error=4
> 
> A connection attempt starts 1) and the ctrl is in state CONNECTING.
> Shortly after the LLDD driver detects a connection lost event and calls
> nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
> state, this event is ignored.
> 
> nvme_fc_create_association continues to run in parallel and tries to
> communicate with the controller and these commands will fail. Though
> these errors are filtered out, e.g in 3) setting the I/O queues numbers
> fails which leads to an early exit in nvme_fc_create_io_queues. Because
> the number of IO queues is 0 at this point, there is nothing left in
> nvme_fc_create_association which could detected the connection drop.
> Thus the ctrl enters LIVE state 4).
> 
> Eventually the keep alive handler times out 5) but because nothing is
> being done, the ctrl stays in LIVE state.
> 
> There is already the ASSOC_FAILED flag to track connectivity loss event
> but this bit is set too late in the recovery code path. Move this into
> the connectivity loss event handler and synchronize it with the state
> change. This ensures that the ASSOC_FAILED flag is seen by
> nvme_fc_create_io_queues and it does not enter the LIVE state after a
> connectivity loss event. If the connectivity loss event happens after we
> entered the LIVE state the normal error recovery path is executed.
> 
> Signed-off-by: Daniel Wagner <wagi@...nel.org>
> ---
>   drivers/nvme/host/fc.c | 23 ++++++++++++++++++-----
>   1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 7409da42b9ee580cdd6fe78c0f93e78c4ad08675..55884d3df6f291cfddb4742e135b54a72f1cfa05 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -781,11 +781,19 @@ nvme_fc_abort_lsops(struct nvme_fc_rport *rport)
>   static void
>   nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
>   {
> +	enum nvme_ctrl_state state;
> +	unsigned long flags;
> +
>   	dev_info(ctrl->ctrl.device,
>   		"NVME-FC{%d}: controller connectivity lost. Awaiting "
>   		"Reconnect", ctrl->cnum);
>   
> -	switch (nvme_ctrl_state(&ctrl->ctrl)) {
> +	spin_lock_irqsave(&ctrl->lock, flags);
> +	set_bit(ASSOC_FAILED, &ctrl->flags);
> +	state = nvme_ctrl_state(&ctrl->ctrl);
> +	spin_unlock_irqrestore(&ctrl->lock, flags);
> +
> +	switch (state) {
>   	case NVME_CTRL_NEW:
>   	case NVME_CTRL_LIVE:
>   		/*
> @@ -2542,7 +2550,6 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
>   	 */
>   	if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
>   		__nvme_fc_abort_outstanding_ios(ctrl, true);
> -		set_bit(ASSOC_FAILED, &ctrl->flags);
>   		dev_warn(ctrl->ctrl.device,
>   			"NVME-FC{%d}: transport error during (re)connect\n",
>   			ctrl->cnum);
> @@ -3167,12 +3174,18 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
>   		else
>   			ret = nvme_fc_recreate_io_queues(ctrl);
>   	}
> -	if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
> -		ret = -EIO;
>   	if (ret)
>   		goto out_term_aen_ops;
>   
> -	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> +	spin_lock_irqsave(&ctrl->lock, flags);
> +	if (!test_bit(ASSOC_FAILED, &ctrl->flags))
> +		changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
> +	else
> +		ret = -EIO;
> +	spin_unlock_irqrestore(&ctrl->lock, flags);
> +
> +	if (ret)
> +		goto out_term_aen_ops;
>   
>   	ctrl->ctrl.nr_reconnects = 0;
>   
> 
Reviewed-by: Hannes Reinecke <hare@...e.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@...e.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich