[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpK+O2FNnyk5V7L83uMXsFg+yVhDztka+UW5-tKsCVA5TgSFg@mail.gmail.com>
Date: Tue, 3 Feb 2026 16:48:25 -0800
From: Randy Jennings <randyj@...estorage.com>
To: Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: Hannes Reinecke <hare@...e.de>, Justin Tee <justin.tee@...adcom.com>,
Naresh Gottumukkala <nareshgottumukkala83@...il.com>, Paul Ely <paul.ely@...adcom.com>,
Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>,
Aaron Dailey <adailey@...estorage.com>, Dhaval Giani <dgiani@...estorage.com>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that
hits an error
On Tue, Feb 3, 2026 at 1:24 PM Mohamed Khalfella
<mkhalfella@...estorage.com> wrote:
>
> On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
> > On 1/30/26 23:34, Mohamed Khalfella wrote:
> > > An alive nvme controller that hits an error now will move to FENCING
> > > state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> > > terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> > > and continue error recovery as usual. If CCR fails, the behavior depends
> > > on whether the subsystem supports CQT or not. If CQT is not supported
> > > then reset the controller immediately as if CCR succeeded in order to
> > > maintain the current behavior. If CQT is supported switch to time-based
> > > recovery. Schedule ctrl->fenced_work resets the controller when time
> > > based recovery finishes.
> > >
> > > Either ctrl->err_work or ctrl->reset_work can run after a controller is
> > > fenced. Flush fencing work when either work run.
> > >
> > > Signed-off-by: Mohamed Khalfella <mkhalfella@...estorage.com>
...
> > Here you are calling CCR whenever error recovery is triggered.
> > This will cause CCR to be send from a command timeout, which is
> > technically wrong (CCR should be send when the KATO timeout expires,
> > not when a command timout expires). Both could be vastly different.
> > So I'd prefer to have CCR send whenever KATO timeout triggers, and
> > lease to current command timeout mechanism in place.
Hannas, It is incorrect that CCR should be sent when the KATO timeout expires,
not when a command timeout expires.
KATO timeout expiring is what happens on the controller, not the host. The
controller behavior is specified because the host has to know what the
controller will do. The host is free to decide that a connection/controller
association should be abandoned whenever the host wants to. It can be
either a timeout on a keep alive (which is not KATO expiring) or any other
command.
But once the host has decided to abandon and tear down the connection/
controller association, it has to make sure no pending requests are still
outstanding on the controller side. And that is either through CCR or
through time-based recovery.
So, if, after a command timeout, the host decides to cancel/abort the
command, but not tear down the association, we should not trigger a
CCR. But, if we are tearing down the connection (and there are pending
commands, we should trigger CCR (and start time based recovery).
Sincerely,
Randy Jennings
Powered by blists - more mailing lists