linux-kernel - Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260210013901.GE2392949-mkhalfella@purestorage.com>
Date: Mon, 9 Feb 2026 17:39:01 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Hannes Reinecke <hare@...e.de>
Cc: Justin Tee <justin.tee@...adcom.com>,
	Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
	Paul Ely <paul.ely@...adcom.com>,
	Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
	Sagi Grimberg <sagi@...mberg.me>,
	Aaron Dailey <adailey@...estorage.com>,
	Randy Jennings <randyj@...estorage.com>,
	Dhaval Giani <dgiani@...estorage.com>,
	linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that
 hits an error

On Wed 2026-02-04 03:57:20 +0100, Hannes Reinecke wrote:
> On 2/3/26 22:24, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
> >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> >>> An alive nvme controller that hits an error now will move to FENCING
> >>> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> >>> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> >>> and continue error recovery as usual. If CCR fails, the behavior depends
> >>> on whether the subsystem supports CQT or not. If CQT is not supported
> >>> then reset the controller immediately as if CCR succeeded in order to
> >>> maintain the current behavior. If CQT is supported switch to time-based
> >>> recovery. Schedule ctrl->fenced_work resets the controller when time
> >>> based recovery finishes.
> >>>
> >>> Either ctrl->err_work or ctrl->reset_work can run after a controller is
> >>> fenced. Flush fencing work when either work run.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> >>> ---
> >>>    drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
> >>>    1 file changed, 61 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >>> index 69cb04406b47..af8d3b36a4bb 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
> >>>    	struct sockaddr_storage src_addr;
> >>>    	struct nvme_ctrl	ctrl;
> >>>    
> >>> +	struct work_struct	fencing_work;
> >>> +	struct delayed_work	fenced_work;
> >>>    	struct work_struct	err_work;
> >>>    	struct delayed_work	connect_work;
> >>>    	struct nvme_tcp_request async_req;
> >>> @@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >>>    
> >>>    static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> >>>    {
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> >>> +		dev_warn(ctrl->device, "starting controller fencing\n");
> >>> +		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> >>> +		return;
> >>> +	}
> >>> +
> >>
> >> Don't you need to flush any outstanding 'fenced_work' queue items here
> >> before calling 'queue_work()'?
> > 
> > I do not think we need to flush ctr->fencing_work. It can not be running
> > at this time.
> > 
> Hmm. If you say so ... I'd rather make sure here.
> These things have a habit of popping up unexpectdly.

A LIVE controller that hits an error queues ctrl->fencing_work.
fencing_work will do CCR. If CCR succeeds then it will queue
ctrl->err_work. If CCR fails then it will queue ctrl->fenced_work.
ctrl->err_work flushes both fencing_work and fenced_work before doing
anything. In the event where a controller is reset just after transition
FENCING->FENCED before FENCED->RESETTING then reset work will run.
ctrl->reset_work also flushes both work.

In all cases these works should be flushed before going to CONNECTING
state. That means by the time they are queued again they should not be
running.

> 
> >>
> >>>    	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>>    		return;
> >>>    
> >>> @@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> >>>    	nvme_tcp_reconnect_or_remove(ctrl, ret);
> >>>    }
> >>>    
> >>> +static void nvme_tcp_fenced_work(struct work_struct *work)
> >>> +{
> >>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> >>> +					struct nvme_tcp_ctrl, fenced_work);
> >>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>> +
> >>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> >>> +}
> >>> +
> >>> +static void nvme_tcp_fencing_work(struct work_struct *work)
> >>> +{
> >>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> >>> +			struct nvme_tcp_ctrl, fencing_work);
> >>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>> +	unsigned long rem;
> >>> +
> >>> +	rem = nvme_fence_ctrl(ctrl);
> >>> +	if (!rem)
> >>> +		goto done;
> >>> +
> >>> +	if (!ctrl->cqt) {
> >>> +		dev_info(ctrl->device,
> >>> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> >>> +		goto done;
> >>> +	}
> >>> +
> >>
> >> As mentioned, cqt handling should be part of another patchset.
> > 
> > Let us suppose we drop cqt from this patchset
> > 
> > - How will we be able to calculate CCR time budget?
> >    Currently it is calculated by nvme_fence_timeout_ms()
> > 
> The CCR time budget is calculated by the current KATO value.
> CQT is the time a controler requires _after_ KATO expires
> to clear out commands.

So it is the same formulate in nvme_fence_timeout_ms() with cqt=0?

> 
> > - What should we do if CCR fails? Retry requests immediately?
> > 
> No. If CCR fails we should fall back to error handling.
> In our case it would mean we let the command timeout handler
> initate a controller reset.

Assuming that we are not implementing CQT delay then we should just
retry requests after CCR finishes. Regardless of the CCR success or
failure. What else we can do if we do not want to hold requests?

What is being suggested is to transition the controller to FENCED state
and leave it there. timeout callback handler should trigger error
recovery which transition to RESETTING as usual. Did I get this part
correctly?

If this is correct then why wait for antoher timeout, which can be 60
seconds for admin requests? We know the controller should be reset I
think we should do what the code does now and proceed with reset.

> _Eventually_ (ie after we implemented CCR _and_ CQT) we would
> wait for KATO (+ CQT) to expire and _then_ reset the controller.

This is what has been implemented in this patchset, right?

> 
> >>> +	dev_info(ctrl->device,
> >>> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> >>> +		 jiffies_to_msecs(rem));
> >>> +	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
> >>> +	return;
> >>> +
> >>
> >> Why do you need the 'fenced' workqueue at all? All it does is queing yet
> >> another workqueue item, which certainly can be done from the 'fencing'
> >> workqueue directly, no?
> > 
> > It is possible to drop ctr->fenced_work and requeue ctrl->fencing_work
> > as delayed work to implement request hold time. If we do that then we
> > need to modify nvme_tcp_fencing_work() to tell if it is being called for
> > 'fencing' or 'fenced'. The first version of this patch used a controller
> > flag RECOVERED for that and it has been suggested to use a separate work
> > to simplify the logic and drop the controller flag.
> > 
> But that's just because you integrated CCR within the current error 
> recovery.
> If you were to implement CCR as to be invoked from 
> nvme_keep_alive_end_io() we would not need to touch that,
> and we would need just one workqueue.
> 
> Let me see if I can draft up a patch.

I look forward to seeing this draft patch.

> 
> >>
> >>> +done:
> >>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> >>> +}
> >>> +
> >>> +static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
> >>> +{
> >>> +	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> >>> +	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
> >>> +}
> >>> +
> >>>    static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >>>    {
> >>>    	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> >>>    				struct nvme_tcp_ctrl, err_work);
> >>>    	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>>    
> >>> +	nvme_tcp_flush_fencing_work(ctrl);
> >>
> >> Why not 'fenced_work' ?
> > 
> > You mean rename nvme_tcp_flush_fencing_work() to
> > nvme_tcp_flush_fenced_work()?
> > 
> > If yes, then I can do that if you think it makes more sense.
> > 
> Thing is, you have two workqueues dealing with CCR.
> You really need to avoid that _both_ are empty when this
> funciton is called.

Yes, that is why we call nvme_tcp_flush_fencing_work().

> 
> >>
> >>>    	if (nvme_tcp_key_revoke_needed(ctrl))
> >>>    		nvme_auth_revoke_tls_key(ctrl);
> >>>    	nvme_stop_keep_alive(ctrl);
> >>> @@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> >>>    		container_of(work, struct nvme_ctrl, reset_work);
> >>>    	int ret;
> >>>    
> >>> +	nvme_tcp_flush_fencing_work(ctrl);
> >>
> >> Same.
> >>
> >>>    	if (nvme_tcp_key_revoke_needed(ctrl))
> >>>    		nvme_auth_revoke_tls_key(ctrl);
> >>>    	nvme_stop_ctrl(ctrl);
> >>> @@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
> >>>    	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
> >>>    	struct nvme_command *cmd = &pdu->cmd;
> >>>    	int qid = nvme_tcp_queue_id(req->queue);
> >>> +	enum nvme_ctrl_state state;
> >>>    
> >>>    	dev_warn(ctrl->device,
> >>>    		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
> >>>    		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
> >>>    		 nvme_fabrics_opcode_str(qid, cmd), qid);
> >>>    
> >>> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> >>> +	state = nvme_ctrl_state(ctrl);
> >>> +	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
> >>
> >> 'FENCED' too, presumably?
> > 
> > I do not think it makes a difference here. FENCED and RESETTING are
> > almost the same states.
> > 
> Yeah, but in FENCED all commands will be aborted, and the same action 
> will be invoked when this if() clause is entered. So you need to avoid
> entering the if() clause to avoid races.

That is okay. FENCED controller means either CCR succeeded or requests
have been held for time-based recovery. In both cases it should be okay
for these requests to be canceled. The only reason FENCED state exists
is that we do not want to have state transition from FENCING ->
RESETTING because we do not want anyone to reset the controller while it
is being fenced. Otherwise FENCED = RESETTING.

> 
> >>
> >>>    		/*
> >>>    		 * If we are resetting, connecting or deleting we should
> >>>    		 * complete immediately because we may block controller
> >>> @@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
> >>>    
> >>>    	INIT_DELAYED_WORK(&ctrl->connect_work,
> >>>    			nvme_tcp_reconnect_ctrl_work);
> >>> +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
> >>> +	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
> >>>    	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
> >>>    	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
> >>>    
> >>
> >> Here you are calling CCR whenever error recovery is triggered.
> >> This will cause CCR to be send from a command timeout, which is
> >> technically wrong (CCR should be send when the KATO timeout expires,
> >> not when a command timout expires). Both could be vastly different.
> > 
> > KATO is driven by the host. What does KTO expires mean?
> > I think KATO expiry is more applicable to target, no?
> > 
> KATO expiry (for this implementation) means the nvme_keep_alive_end_io()
> is called with rtt > deadline.
> 
> > KATO timeout is a signal of an error that target is not reachable or
> > something is wrong with the target?
> > 
> Yes. It explicitely means that the Keep-Alive command was not completed
> within a time period specified by the Keep-Alive timeout value.
> 
> >>
> >> So I'd prefer to have CCR send whenever KATO timeout triggers, and
> >> lease to current command timeout mechanism in place.
> > 
> > Assuming we used CCR only when KATO request times out.
> > What should we do when we hit other errors?
> > 
> Leave it. This patchset is _NOT_ about fixing the error handler.
> This patchset is about implementing CCR.
> And CCR is just a step towared fixing the error handler.

If we leave it then an IO request timeout will cause controller reset
and inflight IOs will be retried immediately, is that what we want?

What if we have a target that responds to keepalive commands but not IO
commands, should we reset the controller if we hit IO timeout?

If we should not reset the controller, then what else we should do?

If we should reset the controller, then what should happen to inflight
IOs?

I do not agree that CCR should be triggered only if keepalive request
timesout. I do not see the link between the two.

> 
> > should nvme_tcp_error_recovery() is called from many places to handle
> > errors and it effectively resets the controller. What should this
> > function do if not trigger CCR?
> > 
> The same things as before. CCR needs to be invoked when the Keep-Alive
> timeout period expires. And that is what we should be implementing here.
> 
> It _might_ (and arguably should) be invoked when error handling needs
> to be done. But with a later patchset.
> 
> Yes, this will result in a patchset where the error handler still has
> gaps. But it will result in a patchset which focusses on a single thing,
> with the added benefit that the workings are clearly outlined in the 
> spec. So the implementation is pretty easy to review.
> 
> Hooking CCR into the error handler itself is _not_ mandated by the spec,
> and inevitably needs to larger discussions (as this thread nicely 
> demonstrates). So I would prefer to keep it simple first, and focus
> on the technical implementation of CCR.
> And concentrate on fixing up the error handler once CCR is in.

Let us say CCR is only triggered by keepalive timeout, and we landed CCR
this way and now it is time to fix the error handler.

What should error handler do to handle say IO timeout?

Let us say we attempted to cancel the timedout comamnd and cancel
command failed. What is the next step in this case?

> 
> Cheers,
> 
> Hannes
> >>
> >> Cheers,
> >>
> >> Hannes
> >> -- 
> >> Dr. Hannes Reinecke                  Kernel Storage Architect
> >> hare@...e.de                                +49 911 74053 688
> >> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> >> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 
> 
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@...e.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich