[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260204001552.GJ3729-mkhalfella@purestorage.com>
Date: Tue, 3 Feb 2026 16:15:52 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: James Smart <jsmart833426@...il.com>
Cc: Justin Tee <justin.tee@...adcom.com>,
Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
Paul Ely <paul.ely@...adcom.com>,
Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
Sagi Grimberg <sagi@...mberg.me>,
Aaron Dailey <adailey@...estorage.com>,
Randy Jennings <randyj@...estorage.com>,
Dhaval Giani <dgiani@...estorage.com>,
Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from
controller reset
On Tue 2026-02-03 14:49:01 -0800, James Smart wrote:
> On 2/3/2026 11:19 AM, James Smart wrote:
> > On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> ...
> >> static void
> >> nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >> {
> >> @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >> nvme_fc_complete_rq(rq);
> >> check_error:
> >> - if (terminate_assoc &&
> >> - nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> >> - queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> >> + if (terminate_assoc)
> >> + nvme_fc_start_ioerr_recovery(ctrl, "io error");
> >
> > this is ok. the ioerr_recovery will bounce the RESETTING state if it's
> > already in the state. So this is a little cleaner.a
>
> What is problematic here is - if the start_ioerr path includes the
> CONNECTING logic that terminates i/o's, it's running in the LLDD's
> context that called this iodone routine. Not good. In existing code, the
> LLDD context was swapped to the work queue where error_recovery was called.
nvme_fc_start_ioerr_recovery() does not do the work in LLDD context. It
queues ctrl->ioerr_work. This is similar to existing code. I responed to
the issue with CONNECING state in another email.
>
> >
> >> }
> >> static int
> >> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct
> >> nvme_fc_ctrl *ctrl, bool start_queues)
> >> nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >> }
> >> -static void
> >> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> >> -{
> >> - enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> >> -
> >> - /*
> >> - * if an error (io timeout, etc) while (re)connecting, the remote
> >> - * port requested terminating of the association (disconnect_ls)
> >> - * or an error (timeout or abort) occurred on an io while creating
> >> - * the controller. Abort any ios on the association and let the
> >> - * create_association error path resolve things.
> >> - */
> >> - if (state == NVME_CTRL_CONNECTING) {
> >> - __nvme_fc_abort_outstanding_ios(ctrl, true);
> >> - dev_warn(ctrl->ctrl.device,
> >> - "NVME-FC{%d}: transport error during (re)connect\n",
> >> - ctrl->cnum);
> >> - return;
> >> - }
> >
> > This logic needs to be preserved. Its no longer part of
> > nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be
> > "fenced". They should fail immediately.
>
> this logic, if left in start_ioerr_recovery
I think it should be okay to rely on error recovery to handle this
situation.
>
>
> -- james
Powered by blists - more mailing lists