linux-kernel - Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251225171709.GA8129-mkhalfella@purestorage.com>
Date: Thu, 25 Dec 2025 09:17:09 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
	Aaron Dailey <adailey@...estorage.com>,
	Randy Jennings <randyj@...estorage.com>,
	John Meneghini <jmeneghi@...hat.com>,
	Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state

On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> 
> 
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > Add NVME_CTRL_RECOVERING as a new controller state to be used when
> > impacted controller is being recovered. A LIVE controller enters
> > RECOVERING state when an IO error is encountered. While recovering
> > inflight IOs will not be canceled if they timeout. These IOs will be
> > canceled after recovery finishes. Also, while recovering a controller
> > can not be reset or deleted. This is intentional because reset or delete
> > will result in canceling inflight IOs. When recovery finishes, the
> > impacted controller transitions from RECOVERING state to RESETTING state.
> > Reset codepath takes care of queues teardown and inflight requests
> > cancellation.
> 
> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> or QUIESCING?

Naming is hard. QUIESCING sounds better, I will renaming it to
QUIESCING.

> 
> >
> > Note, there is no transition from RECOVERING to RESETTING added to
> > nvme_change_ctrl_state(). The reason is that user should not be allowed
> > to reset or delete a controller that is being recovered.
> >
> > Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> > about to schedule delayed work for time based recovery.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> > ---
> >   drivers/nvme/host/core.c  | 10 ++++++++++
> >   drivers/nvme/host/nvme.h  |  2 ++
> >   drivers/nvme/host/sysfs.c |  1 +
> >   3 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index aa007a7b9606..f5b84bc327d3 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >   			break;
> >   		}
> >   		break;
> > +	case NVME_CTRL_RECOVERING:
> > +		switch (old_state) {
> > +		case NVME_CTRL_LIVE:
> > +			changed = true;
> > +			fallthrough;
> > +		default:
> > +			break;
> > +		}
> > +		break;
> 
> That is a strange transition...

Why is it strange?

We transition to RECOVERING state only if controller is LIVE. This is
when we expect to have inflight user IOs to be quiesced by CCR. We do
not care about inflight requests in other states.