linux-kernel - Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260101002723.GS3864520-mkhalfella@purestorage.com>
Date: Wed, 31 Dec 2025 16:27:23 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
	Aaron Dailey <adailey@...estorage.com>,
	Randy Jennings <randyj@...estorage.com>,
	John Meneghini <jmeneghi@...hat.com>,
	Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that
 hits an error

On Sat 2025-12-27 12:35:23 +0200, Sagi Grimberg wrote:
> 
> 
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > An alive nvme controller that hits an error now will move to RECOVERING
> > state instead of RESETTING state. In RECOVERING state ctrl->err_work
> > will attempt to use cross-controller recovery to terminate inflight IOs
> > on the controller. If CCR succeeds, then switch to RESETTING state and
> > continue error recovery as usuall by tearing down controller and attempt
> > reconnecting to target. If CCR fails, then the behavior of recovery
> > depends on whether CQT is supported or not. If CQT is supported, switch
> > to time-based recovery by holding inflight IOs until it is safe for them
> > to be retried. If CQT is not supported proceed to retry requests
> > immediately, as the code currently does.
> >
> > To support implementing time-based recovery turn ctrl->err_work into
> > delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> > while controller in RECOVERING state.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> > ---
> >   drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
> >   1 file changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 9a96df1a511c..ec9a713490a9 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
> >   	struct sockaddr_storage src_addr;
> >   	struct nvme_ctrl	ctrl;
> >   
> > -	struct work_struct	err_work;
> > +	struct delayed_work	err_work;
> >   	struct delayed_work	connect_work;
> >   	struct nvme_tcp_request async_req;
> >   	u32			io_queues[HCTX_MAX_TYPES];
> > @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >   
> >   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> >   {
> > -	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > +	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> > +	    !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> 
> This warrants an explanation. It is not clear at all why we should allow 
> two different
> transitions to allow error recovery to start...

The behavior of the ctrl->err_work depends on the controller state. We
go to RECOVERING only if the controller is LIVE. Otherwise, we attempt
to got to RESETTING.

> 
> >   		return;
> >   
> >   	dev_warn(ctrl->device, "starting error recovery\n");
> > -	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> > +	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
> >   }
> >   
> >   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> > @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> >   	nvme_tcp_reconnect_or_remove(ctrl, ret);
> >   }
> >   
> > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> > +{
> > +	unsigned long rem;
> > +
> > +	if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> > +		dev_info(ctrl->device, "completed time-based recovery\n");
> > +		goto done;
> > +	}
> 
> This is also not clear, why should we get here when NVME_CTRL_RECOVERED 
> is set?

NVME_CTRL_RECOVERED flag is set before scheduling ctrl->err_work as
delayed work. This is how how time-based recovery is implemented.
We get here when ctrl->err_work runs for the second time, and at this
point we know that it is safe to just reset the controller and cancel
inflight requests.

> > +
> > +	rem = nvme_recover_ctrl(ctrl);
> > +	if (!rem)
> > +		goto done;
> > +
> > +	if (!ctrl->cqt) {
> > +		dev_info(ctrl->device,
> > +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> > +		goto done;
> > +	}
> > +
> > +	dev_info(ctrl->device,
> > +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > +		 jiffies_to_msecs(rem));
> > +	set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > +	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > +	return -EAGAIN;
> 
> I don't think that reusing the same work to handle two completely 
> different things
> is the right approach here.
> 
> How about splitting to fence_work and err_work? That should eliminate 
> some of the
> ctrl state inspections and simplify error recovery.
> 
> > +
> > +done:
> > +	nvme_end_ctrl_recovery(ctrl);
> > +	return 0;
> > +}
> > +
> >   static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   {
> > -	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> >   				struct nvme_tcp_ctrl, err_work);
> >   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >   
> > +	if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> > +		if (nvme_tcp_recover_ctrl(ctrl))
> > +			return;
> > +	}
> > +
> 
> Yea, I think we want to rework the current design.

Good point. Splitting ctrl->fence_work simplifies things. The if
condition above will be moved to fence_work. However, we will still need
to reschedule ctrl->fence_work from within its self to implement
time-based recovery. Is this good option?

If not, and we prefer to drop NVME_CTRL_RECOVERED flag above and not
reschedule ctrl->fence_work from within its self, then we can add
another ctr->fenced_work. How about that?