linux-kernel - Re: [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPpK+O3n2o7bJWorPZgO2eJFqn3+K0ZRwZwaumjC9WnrrPyCSw@mail.gmail.com>
Date: Fri, 19 Dec 2025 17:21:19 -0800
From: Randy Jennings <randyj@...estorage.com>
To: Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>, 
	Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>, 
	Aaron Dailey <adailey@...estorage.com>, John Meneghini <jmeneghi@...hat.com>, 
	Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that
 hits an error

On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@...estorage.com> wrote:
>
> An alive nvme controller that hits an error will now move to RECOVERING
> state instead of RESETTING state. In RECOVERING state, ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down the controller, and
> attempting reconnect to target. If CCR fails, the behavior of recovery
"usuall" -> "usual"
"attempt reconnecting" -> "attempting to reconnect"

it would read better with "the" added:
"reconnect to the target"

> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.

> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c

> @@ -1862,11 +1862,48 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,

> +static int nvme_fc_recover_ctrl(struct nvme_ctrl *ctrl)

> +       queue_delayed_work(nvme_reset_wq, &to_fc_ctrl(ctrl)->ioerr_work, rem);
Just like nvme_rdma_recover_ctrl,
nvme_fc_recover_ctrl is exactly the same as
nvme_tcp_recover_ctrl. Seems like a core.c function
nvme_recover_ctrl could take a delayed work queue,
unifying the code.

>  nvme_fc_ctrl_ioerr_work(struct work_struct *work)
>  {

> +       if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING) {
> +               if (nvme_fc_recover_ctrl(&ctrl->ctrl))
> +                       return;
> +       }
>
>         nvme_fc_error_recovery(ctrl);
Inside of nvme_fc_error_recovery(), we call nvme_stop_keep_alive().
The state of the controller should not be LIVE while waiting for
recovery, so I do not think we will succeed in sending keep alives,
but I think this should move to before (or inside of)
nvme_fc_recover_ctrl().  You have replaced all the calls to
nvme_fc_error_recovery() with nvme_fc_start_ioerr_recovery(),
so that might be okay.

Sincerely,
Randy Jennings