linux-kernel - Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8487298a-ea00-4c3e-a882-bfdf97021a1f@gmail.com>
Date: Wed, 4 Feb 2026 16:08:12 -0800
From: James Smart <jsmart833426@...il.com>
To: Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: Justin Tee <justin.tee@...adcom.com>,
 Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
 Paul Ely <paul.ely@...adcom.com>, Chaitanya Kulkarni <kch@...dia.com>,
 Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
 Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>,
 Aaron Dailey <adailey@...estorage.com>,
 Randy Jennings <randyj@...estorage.com>,
 Dhaval Giani <dgiani@...estorage.com>, Hannes Reinecke <hare@...e.de>,
 linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org,
 jsmart833426@...il.com
Subject: Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller
 reset

On 2/3/2026 4:11 PM, Mohamed Khalfella wrote:
> On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
>> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
...
>>>    
>>> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
>>> +					 char *errmsg)
>>> +{
>>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>> +		return;
>>   > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error
>> recovery %s\n",
>>> +		 ctrl->cnum, errmsg);
>>> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
>>> +}
>>> +
>>
>> Disagree with this.
>>
>> The clause in error_recovery around the CONNECTING state is pretty
>> important to terminate io occurring during connect/reconnect where the
>> ctrl state should not change. we don't want start_ioerr making it RESETTING.
>>
>> This should be reworked.
> 
> Like you pointed out this changes the current behavior for CONNECTING
> state.
> 
> Before this change, as you pointed out the controller state stays in
> CONNECTING while all IOs are aborted. Aborting the IOs causes
> nvme_fc_create_association() to fail and reconnect might be attempted
> again.
> The new behavior switches to RESETTING and queues ctr->ioerr_work.
> ioerr_work will abort oustanding IOs, swich back to CONNECING and
> attempt reconnect.

Well, it won't actually switch to RESETTING, as CONNECTING->RESETTING is 
not a valid transition.  So things will silently stop in 
start_ioerr_recovery when the state transition fails (also a reason I 
dislike silent state transition failures).

When I look a little further into patch 13, I see the change to FENCING 
added. But that state transition will also fail for CONNECTING->FENCING. 
It will then fall into the resetting state change, which will silently 
fail, and we're stopped.  It says to me there was no consideration or 
testing of failures while CONNECTING with this patch set.  Even if 
RESETTING were allowed, its injecting a new flow into the code paths.

The CONNECTING issue also applies to tcp and rdma transports. I don't 
know if they call the error_recovery routines in the same way.

To be honest I'm not sure I remember the original reasons this loop was 
put in, but I do remember pain I went through when generating it and the 
number of test cases that were needed to cover testing. It may well be 
because I couldn't invoke the reset due to the CONNECTING->RESETTING 
block.  I'm being pedantic as I still feel residual pain for it.


> 
> nvme_fc_error_recovery() ->
>    nvme_stop_keep_alive() /* should not make a difference */
>    nvme_stop_ctrl()       /* should be okay to run */
>    nvme_fc_delete_association() ->
>      __nvme_fc_abort_outstanding_ios(ctrl, false)
>      nvme_unquiesce_admin_queue()
>      nvme_unquiesce_io_queues()
>      nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
>      if (port_state == ONLINE)
>        queue_work(ctrl->connect)
>      else
>        nvme_fc_reconnect_or_delete();
> 
> Yes, this is a different behavior. IMO it is simpler to follow and
> closer to what other transports do, keeping in mind async abort nature
> of fc.
> 
> Aside from it is different, what is wrong with it?

See above.

...
>>>    static int
>>> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
>>>    		nvme_unquiesce_admin_queue(&ctrl->ctrl);
>>>    }
>>>    
>>> -static void
>>> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
>>> -{
>>> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
>>> -
>>> -	/*
>>> -	 * if an error (io timeout, etc) while (re)connecting, the remote
>>> -	 * port requested terminating of the association (disconnect_ls)
>>> -	 * or an error (timeout or abort) occurred on an io while creating
>>> -	 * the controller.  Abort any ios on the association and let the
>>> -	 * create_association error path resolve things.
>>> -	 */
>>> -	if (state == NVME_CTRL_CONNECTING) {
>>> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
>>> -		dev_warn(ctrl->ctrl.device,
>>> -			"NVME-FC{%d}: transport error during (re)connect\n",
>>> -			ctrl->cnum);
>>> -		return;
>>> -	}
>>
>> This logic needs to be preserved. Its no longer part of
>> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be
>> "fenced". They should fail immediately.
> 
> I think this is similar to the point above.

Forgetting whether or not the above "works", what I'm pointing out is 
that when in CONNECTING I don't believe you should be enacting the 
FENCED state and delaying. For CONNECTING, the cleanup should be 
immediate with no delay and no CCR attempt.  Only LIVE should transition 
to FENCED.

Looking at patch 14, fencing_work calls nvme_fence_ctrl() which 
unconditionally delays and tries to do CCR. We only want this if LIVE. 
I'll comment on that patch.


>> There is a small difference here in that The existing code avoids doing
>> the ctrl reset if the controller is NEW. start_ioerr will change the
>> ctrl to RESETTING. I'm not sure how much of an impact that is.
>>
> 
> I think there is little done while controller in NEW state.
> Let me know if I am missing something.

No - I had to update my understanding I was really out of date. Used to 
be NEW is what initial controller create was done under. Everybody does 
it now under CONNECTING.

...
>>>    static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>>>    {
>>>    	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
>>> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>>>    	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
>>>    	struct nvme_command *sqe = &cmdiu->sqe;
>>>    
>>> -	/*
>>> -	 * Attempt to abort the offending command. Command completion
>>> -	 * will detect the aborted io and will fail the connection.
>>> -	 */
>>>    	dev_info(ctrl->ctrl.device,
>>>    		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
>>>    		"x%08x/x%08x\n",
>>>    		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
>>>    		nvme_fabrics_opcode_str(qnum, sqe),
>>>    		sqe->common.cdw10, sqe->common.cdw11);
>>> -	if (__nvme_fc_abort_op(ctrl, op))
>>> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>>>    
>>> -	/*
>>> -	 * the io abort has been initiated. Have the reset timer
>>> -	 * restarted and the abort completion will complete the io
>>> -	 * shortly. Avoids a synchronous wait while the abort finishes.
>>> -	 */
>>> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
>>
>> Why get rid of the abort logic ?
>> Note: the error recovery/controller reset is only called when the abort
>> failed.
>>
>> I believe you should continue to abort the op.  The fence logic will
>> kick in when the op completes later (along with other io completions).
>> If nothing else, it allows a hw resource to be freed up.
> 
> The abort logic from nvme_fc_timeout() is problematic and it does not
> play well with abort initiatored from ioerr_work or reset_work. The
> problem is that op aborted from nvme_fc_timeout() is not accounted for
> when the controller is reset.

note: I'll wait to be shown otherwise, but if this were true it would be 
horribly broken for a long time.

> 
> Here is an example scenario.
> 
> The first time a request times out it gets aborted we see this codepath
> 
> nvme_fc_timeout() ->
>    __nvme_fc_abort_op() ->
>      atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
>        ops->abort()
>          return 0;

there's more than this in in the code:
it changes op state to ABORTED, saving the old opstate.
if the opstate wasn't active - it means something else changed and it 
restores the old state (e.g. the aborts for the reset may have hit it).
if it was active (e.g. the aborts the reset haven't hit it yet) it 
checks the ctlr flag to see if the controller is being reset and 
tracking io termination (the TERMIO flag) and if so, increments the 
iocnt. So it is "included" in the reset.

if old state was active, it then sends the ABTS.
if old state wasn't active (we've been here before or io terminated by 
reset) it returns -ECANCELED, which will cause a controller reset to be 
attempted if there's not already one in process.


> 
> nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
> can timeout again. If the same request hits timeout again then
> __nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
> gets called. Assuming the controller is LIVE it will be reset.

The normal case is timeout generates ABTS. ABTS usually completes 
quickly with the io completing and the io callback to iodone, which sees 
abort error status and resets controller. Its very typical for the ABTS 
to complete long before the 2nd EH timer timing out.

Abnormal case is ABTS takes longer to complete than the 2nd EH timer 
timing. Yes, that forces the controller reset.   I am aware that some 
arrays will delay ABTS ACC while they terminate the back end, but there 
are also frame drop conditions to consider.

if the controller is already resetting, all the above is largely n/a.

I see no reason to avoid the ABTS and wait for a 2nd EH timer to fire.

> 
> nvme_fc_reset_ctrl_work() ->
>    nvme_fc_delete_association() ->
>      __nvme_fc_abort_outstanding_ios() ->
>        nvme_fc_terminate_exchange() ->
>          __nvme_fc_abort_op()
> 
> __nvme_fc_abort_op() finds that op already aborted. As a result of that
> ctrl->iocnt will not be incrmented for this op. This means that
> nvme_fc_delete_association() will not wait for this op to be aborted.

see missing code stmt above.

> 
> I do not think we wait this behavior.
> 
> To continue the scenario above. The controller switches to CONNECTING
> and the request times out again. This time we hit the deadlock described
> in [1].
> 
> I think the first abort is the cause of the issue here. with this change
> we should not hit the scenario described above.
> 
> 1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/

Something else happened here. You can't get to CONNECTING state unless 
all outstanding io was reaped in delete association. What is also harder 
to understand is how there was an io to timeout if they've all been 
reaped and queues haven't been restarted.  Timeout on one of the ios to 
instatiate/init the controller maybe, but it shouldn't have been one of 
those in the blk layer.

> 
>>
>>
>>>    	return BLK_EH_RESET_TIMER;
>>>    }
>>>    
>>> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
>>>    	}
>>>    }
>>>    
>>> +static void
>>> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
>>> +{
>>> +	nvme_stop_keep_alive(&ctrl->ctrl);
>>
>> Curious, why did the stop_keep_alive() call get added to this ?
>> Doesn't hurt.
>>
>> I assume it was due to other transports having it as they originally
>> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't
>> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?
> 
> Yes. I added it because it matches what other transports do.
> 
> nvme_fc_error_recovery() ->
>    nvme_fc_delete_association() ->
>      nvme_fc_abort_aen_ops() ->
>        nvme_fc_term_aen_ops() ->
>          cancel_work_sync(&ctrl->ctrl.async_event_work);
> 
> The above codepath takes care of async_event_work.

True, but the flush_works were added for a reason to the other 
transports so I'm guessing timing matters. So waiting till ther later 
term_aen call isn't great.  But I also guess, we haven't had an issue 
prior and since we did take care if it in the aen routines, its likely 
unneeded now.  Ok to add it but if so, we should keep the flush_work as 
well. Also good to look same as the other transports.

> 
>>
>>> +	nvme_stop_ctrl(&ctrl->ctrl);
>>> +
>>> +	/* will block while waiting for io to terminate */
>>> +	nvme_fc_delete_association(ctrl);
>>> +
>>> +	/* Do not reconnect if controller is being deleted */
>>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
>>> +		return;
>>> +
>>> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
>>> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
>>> +		return;
>>> +	}
>>> +
>>> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
>>> +}
>>
>> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed
>> into a common helper function invoked by the 2 routines.  Also addresses
>> the missing flush_delayed work in this routine.
>>
> 
> Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
> common code that can be refactored. However, I do not plan to do this
> part of this change. I will take a look after I get CCR work done.

Don't put it off. You are adding as much code as the refactoring is. 
Just make the change.

-- james