[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260101002723.GS3864520-mkhalfella@purestorage.com>
Date: Wed, 31 Dec 2025 16:27:23 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
Aaron Dailey <adailey@...estorage.com>,
Randy Jennings <randyj@...estorage.com>,
John Meneghini <jmeneghi@...hat.com>,
Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that
hits an error
On Sat 2025-12-27 12:35:23 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > An alive nvme controller that hits an error now will move to RECOVERING
> > state instead of RESETTING state. In RECOVERING state ctrl->err_work
> > will attempt to use cross-controller recovery to terminate inflight IOs
> > on the controller. If CCR succeeds, then switch to RESETTING state and
> > continue error recovery as usuall by tearing down controller and attempt
> > reconnecting to target. If CCR fails, then the behavior of recovery
> > depends on whether CQT is supported or not. If CQT is supported, switch
> > to time-based recovery by holding inflight IOs until it is safe for them
> > to be retried. If CQT is not supported proceed to retry requests
> > immediately, as the code currently does.
> >
> > To support implementing time-based recovery turn ctrl->err_work into
> > delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> > while controller in RECOVERING state.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> > ---
> > drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
> > 1 file changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 9a96df1a511c..ec9a713490a9 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
> > struct sockaddr_storage src_addr;
> > struct nvme_ctrl ctrl;
> >
> > - struct work_struct err_work;
> > + struct delayed_work err_work;
> > struct delayed_work connect_work;
> > struct nvme_tcp_request async_req;
> > u32 io_queues[HCTX_MAX_TYPES];
> > @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >
> > static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> > {
> > - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> > + !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>
> This warrants an explanation. It is not clear at all why we should allow
> two different
> transitions to allow error recovery to start...
The behavior of the ctrl->err_work depends on the controller state. We
go to RECOVERING only if the controller is LIVE. Otherwise, we attempt
to got to RESETTING.
>
> > return;
> >
> > dev_warn(ctrl->device, "starting error recovery\n");
> > - queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
> > }
> >
> > static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> > @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> > nvme_tcp_reconnect_or_remove(ctrl, ret);
> > }
> >
> > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> > +{
> > + unsigned long rem;
> > +
> > + if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> > + dev_info(ctrl->device, "completed time-based recovery\n");
> > + goto done;
> > + }
>
> This is also not clear, why should we get here when NVME_CTRL_RECOVERED
> is set?
NVME_CTRL_RECOVERED flag is set before scheduling ctrl->err_work as
delayed work. This is how how time-based recovery is implemented.
We get here when ctrl->err_work runs for the second time, and at this
point we know that it is safe to just reset the controller and cancel
inflight requests.
> > +
> > + rem = nvme_recover_ctrl(ctrl);
> > + if (!rem)
> > + goto done;
> > +
> > + if (!ctrl->cqt) {
> > + dev_info(ctrl->device,
> > + "CCR failed, CQT not supported, skip time-based recovery\n");
> > + goto done;
> > + }
> > +
> > + dev_info(ctrl->device,
> > + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > + jiffies_to_msecs(rem));
> > + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > + return -EAGAIN;
>
> I don't think that reusing the same work to handle two completely
> different things
> is the right approach here.
>
> How about splitting to fence_work and err_work? That should eliminate
> some of the
> ctrl state inspections and simplify error recovery.
>
> > +
> > +done:
> > + nvme_end_ctrl_recovery(ctrl);
> > + return 0;
> > +}
> > +
> > static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > {
> > - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> > struct nvme_tcp_ctrl, err_work);
> > struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >
> > + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> > + if (nvme_tcp_recover_ctrl(ctrl))
> > + return;
> > + }
> > +
>
> Yea, I think we want to rework the current design.
Good point. Splitting ctrl->fence_work simplifies things. The if
condition above will be moved to fence_work. However, we will still need
to reschedule ctrl->fence_work from within its self to implement
time-based recovery. Is this good option?
If not, and we prefer to drop NVME_CTRL_RECOVERED flag above and not
reschedule ctrl->fence_work from within its self, then we can add
another ctr->fenced_work. How about that?
Powered by blists - more mailing lists