[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBJJQoOBhaXj7P36@kbusch-mbp>
Date: Wed, 30 Apr 2025 09:01:06 -0700
From: Keith Busch <kbusch@...nel.org>
To: Daniel Wagner <dwagner@...e.de>
Cc: Guenter Roeck <linux@...ck-us.net>, Hannes Reinecke <hare@...e.de>,
Daniel Wagner <wagi@...nel.org>, Jens Axboe <axboe@...nel.dk>,
Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
James Smart <james.smart@...adcom.com>,
Shinichiro Kawasaki <shinichiro.kawasaki@....com>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] nvme: only allow entering LIVE from CONNECTING state
On Wed, Apr 30, 2025 at 08:43:04AM +0200, Daniel Wagner wrote:
> On Tue, Apr 29, 2025 at 11:42:25AM -0700, Keith Busch wrote:
> > On Tue, Apr 29, 2025 at 11:23:25AM -0700, Guenter Roeck wrote:
> > > On 4/29/25 11:13, Keith Busch wrote:
> > > > On Mon, Apr 28, 2025 at 03:21:18PM +0200, Hannes Reinecke wrote:
> > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > > > index b502ac07483b..d3c4eacf607f 100644
> > > > > > --- a/drivers/nvme/host/core.c
> > > > > > +++ b/drivers/nvme/host/core.c
> > > > > > @@ -4493,7 +4493,8 @@ static void nvme_fw_act_work(struct work_struct *work)
> > > > > > msleep(100);
> > > > > > }
> > > > > >
> > > > > > - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > > > > > + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING) ||
> > > > > > + !nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE))
> > > > > > return;
> > > > > >
> > > > > > nvme_unquiesce_io_queues(ctrl);
> > > > >
> > > > > I would rather have a separate state for firmware activation.
> > > > > (Ab-)using the 'RESETTING' state here has direct implications
> > > > > with the error handler, as for the error handler 'RESETTING'
> > > > > means that the error handler has been scheduled.
> > > > > Which is not true for firmware activation.
> > > >
> > > > But the point of having firmware activation set the state to RESETTING
> > > > was to fence off error handling from trying to schedule a real reset.
> > > > The fw activation work schedules its own recovery if it times out, but
> > > > we don't want any other recovery action or user requested resets to
> > > > proceed while an activation is still pending.
> > >
> > > Not only that; there are various checks against NVME_CTRL_RESETTING
> > > sprinkled through the code. What is the impact of introducing a new state
> > > without handling all those checks ?
> >
> > Good point, bad things will happen if these checks are not updated to
> > know about the new state. For example, nvme-pci will attempt aborting IO
> > or disabling the controller on a timeout instead of restarting the timer
> > as desired.
> >
> > Can we just revert the commit that prevented the RESETTING -> LIVE
> > transtion for now?
>
> Unfortunately, it will break the FC error handling again(*). The
> simplest fix is right above, add the transition from RESETTING to
> CONNECTING and then to LIVE, IMO.
Gotcha, yes, that looks like the simplest fix for the current release
then. We need to be careful with adding new states, so we can revisit
Hannes' suggestion for 6.16 if we really want to split this.
If you send the simple fix as a formal patch, please add my review and
the "Fixes:" tag.
Reviewed-by: Keith Busch <kbusch@...nel.org>
Powered by blists - more mailing lists