linux-kernel - Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5befc95c-b66a-4dd8-bb72-7cc6839c7c4b@grimberg.me>
Date: Sat, 27 Dec 2025 12:35:23 +0200
From: Sagi Grimberg <sagi@...mberg.me>
To: Mohamed Khalfella <mkhalfella@...estorage.com>,
 Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
 Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>
Cc: Aaron Dailey <adailey@...estorage.com>,
 Randy Jennings <randyj@...estorage.com>, John Meneghini
 <jmeneghi@...hat.com>, Hannes Reinecke <hare@...e.de>,
 linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that
 hits an error



On 26/11/2025 4:11, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to RECOVERING
> state instead of RESETTING state. In RECOVERING state ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down controller and attempt
> reconnecting to target. If CCR fails, then the behavior of recovery
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
>
> To support implementing time-based recovery turn ctrl->err_work into
> delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> while controller in RECOVERING state.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> ---
>   drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
>   1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 9a96df1a511c..ec9a713490a9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
>   	struct sockaddr_storage src_addr;
>   	struct nvme_ctrl	ctrl;
>   
> -	struct work_struct	err_work;
> +	struct delayed_work	err_work;
>   	struct delayed_work	connect_work;
>   	struct nvme_tcp_request async_req;
>   	u32			io_queues[HCTX_MAX_TYPES];
> @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>   
>   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>   {
> -	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> +	    !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))

This warrants an explanation. It is not clear at all why we should allow 
two different
transitions to allow error recovery to start...

>   		return;
>   
>   	dev_warn(ctrl->device, "starting error recovery\n");
> -	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> +	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
>   }
>   
>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
>   	nvme_tcp_reconnect_or_remove(ctrl, ret);
>   }
>   
> +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> +{
> +	unsigned long rem;
> +
> +	if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> +		dev_info(ctrl->device, "completed time-based recovery\n");
> +		goto done;
> +	}

This is also not clear, why should we get here when NVME_CTRL_RECOVERED 
is set?
> +
> +	rem = nvme_recover_ctrl(ctrl);
> +	if (!rem)
> +		goto done;
> +
> +	if (!ctrl->cqt) {
> +		dev_info(ctrl->device,
> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> +		goto done;
> +	}
> +
> +	dev_info(ctrl->device,
> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> +		 jiffies_to_msecs(rem));
> +	set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> +	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> +	return -EAGAIN;

I don't think that reusing the same work to handle two completely 
different things
is the right approach here.

How about splitting to fence_work and err_work? That should eliminate 
some of the
ctrl state inspections and simplify error recovery.

> +
> +done:
> +	nvme_end_ctrl_recovery(ctrl);
> +	return 0;
> +}
> +
>   static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   {
> -	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
>   				struct nvme_tcp_ctrl, err_work);
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>   
> +	if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> +		if (nvme_tcp_recover_ctrl(ctrl))
> +			return;
> +	}
> +

Yea, I think we want to rework the current design.