[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5befc95c-b66a-4dd8-bb72-7cc6839c7c4b@grimberg.me>
Date: Sat, 27 Dec 2025 12:35:23 +0200
From: Sagi Grimberg <sagi@...mberg.me>
To: Mohamed Khalfella <mkhalfella@...estorage.com>,
Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>
Cc: Aaron Dailey <adailey@...estorage.com>,
Randy Jennings <randyj@...estorage.com>, John Meneghini
<jmeneghi@...hat.com>, Hannes Reinecke <hare@...e.de>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that
hits an error
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to RECOVERING
> state instead of RESETTING state. In RECOVERING state ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down controller and attempt
> reconnecting to target. If CCR fails, then the behavior of recovery
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
>
> To support implementing time-based recovery turn ctrl->err_work into
> delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> while controller in RECOVERING state.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> ---
> drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 9a96df1a511c..ec9a713490a9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
> struct sockaddr_storage src_addr;
> struct nvme_ctrl ctrl;
>
> - struct work_struct err_work;
> + struct delayed_work err_work;
> struct delayed_work connect_work;
> struct nvme_tcp_request async_req;
> u32 io_queues[HCTX_MAX_TYPES];
> @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> + !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
This warrants an explanation. It is not clear at all why we should allow
two different
transitions to allow error recovery to start...
> return;
>
> dev_warn(ctrl->device, "starting error recovery\n");
> - queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
> }
>
> static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> nvme_tcp_reconnect_or_remove(ctrl, ret);
> }
>
> +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> +{
> + unsigned long rem;
> +
> + if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> + dev_info(ctrl->device, "completed time-based recovery\n");
> + goto done;
> + }
This is also not clear, why should we get here when NVME_CTRL_RECOVERED
is set?
> +
> + rem = nvme_recover_ctrl(ctrl);
> + if (!rem)
> + goto done;
> +
> + if (!ctrl->cqt) {
> + dev_info(ctrl->device,
> + "CCR failed, CQT not supported, skip time-based recovery\n");
> + goto done;
> + }
> +
> + dev_info(ctrl->device,
> + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> + jiffies_to_msecs(rem));
> + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> + return -EAGAIN;
I don't think that reusing the same work to handle two completely
different things
is the right approach here.
How about splitting to fence_work and err_work? That should eliminate
some of the
ctrl state inspections and simplify error recovery.
> +
> +done:
> + nvme_end_ctrl_recovery(ctrl);
> + return 0;
> +}
> +
> static void nvme_tcp_error_recovery_work(struct work_struct *work)
> {
> - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> struct nvme_tcp_ctrl, err_work);
> struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>
> + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> + if (nvme_tcp_recover_ctrl(ctrl))
> + return;
> + }
> +
Yea, I think we want to rework the current design.
Powered by blists - more mailing lists