linux-kernel - Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260204001128.GI3729-mkhalfella@purestorage.com>
Date: Tue, 3 Feb 2026 16:11:28 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: James Smart <jsmart833426@...il.com>
Cc: Justin Tee <justin.tee@...adcom.com>,
	Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
	Paul Ely <paul.ely@...adcom.com>,
	Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
	Sagi Grimberg <sagi@...mberg.me>,
	Aaron Dailey <adailey@...estorage.com>,
	Randy Jennings <randyj@...estorage.com>,
	Dhaval Giani <dgiani@...estorage.com>,
	Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from
 controller reset

On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> > in CONNECTING state results in deadlock reported in link below. Update
> > nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
> > 
> > Previous to this change if controller was LIVE error recovery resets
> > the controller and this does not match nvme-tcp and nvme-rdma.
> 
> It is not intended to match tcp/rda. Using the reset path was done to 
> avoid code duplication of paths to teardown the association.  FC, given 
> we interact with an HBA for device and io state and have a lot of async 
> io completions, requires a lot more work than straight data structure 
> teardown in rdma/tcp.
> 
> I agree with wanting to changeup the execution thread for the deadlock.
> 
> 
> > Decouple> error recovery from controller reset to match other fabric transports.>
> > Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> > ---
> >   drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
> >   1 file changed, 41 insertions(+), 53 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index 6948de3f438a..f8f6071b78ed 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
> >   static struct device *fc_udev_device;
> >   
> >   static void nvme_fc_complete_rq(struct request *rq);
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg);
> >   
> >   /* *********************** FC-NVME Port Management ************************ */
> >   
> > @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
> >   		"Reconnect", ctrl->cnum);
> >   
> >   	set_bit(ASSOC_FAILED, &ctrl->flags);
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > +	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
> >   }
> >   
> >   /**
> > @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> >   static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> >   static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
> >   
> > -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> > +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
> >   
> >   static void
> >   __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> > @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
> >   	 * for the association have been ABTS'd by
> >   	 * nvme_fc_delete_association().
> >   	 */
> > -
> > -	/* fail the association */
> > -	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> > +	nvme_fc_start_ioerr_recovery(ctrl,
> > +				     "Disconnect Association LS received");
> >   
> >   	/* release the reference taken by nvme_fc_match_disconn_ls() */
> >   	nvme_fc_ctrl_put(ctrl);
> > @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> >   	struct nvme_fc_ctrl *ctrl =
> >   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
> >   
> > -	nvme_fc_error_recovery(ctrl, "transport detected io error");
> > +	nvme_fc_error_recovery(ctrl);
> 
> hmm.. not sure how I feel about this. There is at least a break in reset 
> processing that is no longer present - e.g. prior queued ioerr_work, 
> which would then queue reset_work. This effectively calls the reset_work 
> handler directly. I assume it should be ok.
> 
> >   }
> >   
> >   /*
> > @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
> >   }
> >   EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
> >   
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg)
> > +{
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> > +		return;
>  > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error 
> recovery %s\n",
> > +		 ctrl->cnum, errmsg);
> > +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +}
> > +
> 
> Disagree with this.
> 
> The clause in error_recovery around the CONNECTING state is pretty 
> important to terminate io occurring during connect/reconnect where the 
> ctrl state should not change. we don't want start_ioerr making it RESETTING.
> 
> This should be reworked.

Like you pointed out this changes the current behavior for CONNECTING
state.

Before this change, as you pointed out the controller state stays in
CONNECTING while all IOs are aborted. Aborting the IOs causes
nvme_fc_create_association() to fail and reconnect might be attempted
again.

The new behavior switches to RESETTING and queues ctr->ioerr_work.
ioerr_work will abort oustanding IOs, swich back to CONNECING and
attempt reconnect.

nvme_fc_error_recovery() ->
  nvme_stop_keep_alive() /* should not make a difference */
  nvme_stop_ctrl()       /* should be okay to run */
  nvme_fc_delete_association() ->
    __nvme_fc_abort_outstanding_ios(ctrl, false)
    nvme_unquiesce_admin_queue()
    nvme_unquiesce_io_queues()
    nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
    if (port_state == ONLINE)
      queue_work(ctrl->connect)
    else
      nvme_fc_reconnect_or_delete();

Yes, this is a different behavior. IMO it is simpler to follow and
closer to what other transports do, keeping in mind async abort nature
of fc.

Aside from it is different, what is wrong with it?

> 
> >   static void
> >   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   {
> > @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   		nvme_fc_complete_rq(rq);
> >   
> >   check_error:
> > -	if (terminate_assoc &&
> > -	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> > -		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +	if (terminate_assoc)
> > +		nvme_fc_start_ioerr_recovery(ctrl, "io error");
> 
> this is ok. the ioerr_recovery will bounce the RESETTING state if it's 
> already in the state. So this is a little cleaner.
> 
> >   }
> >   
> >   static int
> > @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> >   		nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >   }
> >   
> > -static void
> > -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> > -{
> > -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> > -
> > -	/*
> > -	 * if an error (io timeout, etc) while (re)connecting, the remote
> > -	 * port requested terminating of the association (disconnect_ls)
> > -	 * or an error (timeout or abort) occurred on an io while creating
> > -	 * the controller.  Abort any ios on the association and let the
> > -	 * create_association error path resolve things.
> > -	 */
> > -	if (state == NVME_CTRL_CONNECTING) {
> > -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> > -		dev_warn(ctrl->ctrl.device,
> > -			"NVME-FC{%d}: transport error during (re)connect\n",
> > -			ctrl->cnum);
> > -		return;
> > -	}
> 
> This logic needs to be preserved. Its no longer part of 
> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be 
> "fenced". They should fail immediately.

I think this is similar to the point above.

> 
> > -
> > -	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
> > -	if (state != NVME_CTRL_LIVE)
> > -		return;
> 
> This was to filter out multiple requests of the reset. I guess that is 
> what happens now in start_ioerr when attempting to set state to 
> RESETTING and already RESETTING.

Yes. In this case nvme_fc_start_ioerr_recovery() will do nothing.

> 
> There is a small difference here in that The existing code avoids doing 
> the ctrl reset if the controller is NEW. start_ioerr will change the 
> ctrl to RESETTING. I'm not sure how much of an impact that is.
> 

I think there is little done while controller in NEW state.
Let me know if I am missing something.

> 
> > -
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: transport association event: %s\n",
> > -		ctrl->cnum, errmsg);
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);
> 
> I haven't paid much attention, but keeping the transport messages for 
> these cases is very very useful for diagnosis.
> 
> > -
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > -}
> > -
> >   static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   {
> >   	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> > @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> >   	struct nvme_command *sqe = &cmdiu->sqe;
> >   
> > -	/*
> > -	 * Attempt to abort the offending command. Command completion
> > -	 * will detect the aborted io and will fail the connection.
> > -	 */
> >   	dev_info(ctrl->ctrl.device,
> >   		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> >   		"x%08x/x%08x\n",
> >   		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> >   		nvme_fabrics_opcode_str(qnum, sqe),
> >   		sqe->common.cdw10, sqe->common.cdw11);
> > -	if (__nvme_fc_abort_op(ctrl, op))
> > -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >   
> > -	/*
> > -	 * the io abort has been initiated. Have the reset timer
> > -	 * restarted and the abort completion will complete the io
> > -	 * shortly. Avoids a synchronous wait while the abort finishes.
> > -	 */
> > +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> 
> Why get rid of the abort logic ?
> Note: the error recovery/controller reset is only called when the abort 
> failed.
> 
> I believe you should continue to abort the op.  The fence logic will 
> kick in when the op completes later (along with other io completions). 
> If nothing else, it allows a hw resource to be freed up.

The abort logic from nvme_fc_timeout() is problematic and it does not
play well with abort initiatored from ioerr_work or reset_work. The
problem is that op aborted from nvme_fc_timeout() is not accounted for
when the controller is reset.

Here is an example scenario.

The first time a request times out it gets aborted we see this codepath

nvme_fc_timeout() ->
  __nvme_fc_abort_op() ->
    atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
      ops->abort()
        return 0;

nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
can timeout again. If the same request hits timeout again then
__nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
gets called. Assuming the controller is LIVE it will be reset.

nvme_fc_reset_ctrl_work() ->
  nvme_fc_delete_association() ->
    __nvme_fc_abort_outstanding_ios() ->
      nvme_fc_terminate_exchange() ->
        __nvme_fc_abort_op()

__nvme_fc_abort_op() finds that op already aborted. As a result of that
ctrl->iocnt will not be incrmented for this op. This means that
nvme_fc_delete_association() will not wait for this op to be aborted.

I do not think we wait this behavior.

To continue the scenario above. The controller switches to CONNECTING
and the request times out again. This time we hit the deadlock described
in [1].

I think the first abort is the cause of the issue here. with this change
we should not hit the scenario described above.

1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/

> 
> 
> >   	return BLK_EH_RESET_TIMER;
> >   }
> >   
> > @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> >   	}
> >   }
> >   
> > +static void
> > +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> > +{
> > +	nvme_stop_keep_alive(&ctrl->ctrl);
> 
> Curious, why did the stop_keep_alive() call get added to this ?
> Doesn't hurt.
> 
> I assume it was due to other transports having it as they originally 
> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't 
> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?

Yes. I added it because it matches what other transports do.

nvme_fc_error_recovery() ->
  nvme_fc_delete_association() ->
    nvme_fc_abort_aen_ops() ->
      nvme_fc_term_aen_ops() ->
        cancel_work_sync(&ctrl->ctrl.async_event_work);

The above codepath takes care of async_event_work.

> 
> > +	nvme_stop_ctrl(&ctrl->ctrl);
> > +
> > +	/* will block while waiting for io to terminate */
> > +	nvme_fc_delete_association(ctrl);
> > +
> > +	/* Do not reconnect if controller is being deleted */
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> > +		return;
> > +
> > +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> > +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> > +		return;
> > +	}
> > +
> > +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> > +}
> 
> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed 
> into a common helper function invoked by the 2 routines.  Also addresses 
> the missing flush_delayed work in this routine.
> 

Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
common code that can be refactored. However, I do not plan to do this
part of this change. I will take a look after I get CCR work done.

> >   
> >   static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> >   	.name			= "fc",
> 
> 
> -- James
> 
> (new email address. can always reach me at james.smart@...adcom.com as well)
>