[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpK+O02_ua4Rz+uggMKpp4gpgm-JugmKCs9CUECsdV9WuVdig@mail.gmail.com>
Date: Tue, 10 Feb 2026 19:44:20 -0800
From: Randy Jennings <randyj@...estorage.com>
To: Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: Hannes Reinecke <hare@...e.de>, Justin Tee <justin.tee@...adcom.com>,
Naresh Gottumukkala <nareshgottumukkala83@...il.com>, Paul Ely <paul.ely@...adcom.com>,
Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>,
Aaron Dailey <adailey@...estorage.com>, Dhaval Giani <dgiani@...estorage.com>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
On Wed, Feb 4, 2026 at 3:24 PM Mohamed Khalfella
<mkhalfella@...estorage.com> wrote:
>
> On Wed 2026-02-04 02:10:48 +0100, Hannes Reinecke wrote:
> > On 2/3/26 21:00, Mohamed Khalfella wrote:
> > > On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
> > >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > [ .. ]
> > >>> + timeout = nvme_fence_timeout_ms(ictrl);
> > >>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > >>> +
> > >>> + now = jiffies;
> > >>> + deadline = now + msecs_to_jiffies(timeout);
> > >>> + while (time_before(now, deadline)) {
> > >>> + sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> > >>> + if (!sctrl) {
> > >>> + /* CCR failed, switch to time-based recovery */
> > >>> + return deadline - now;
> > >>> + }
> > >>> +
> > >>> + ret = nvme_issue_wait_ccr(sctrl, ictrl);
> > >>> + if (!ret) {
> > >>> + dev_info(ictrl->device, "CCR succeeded using %s\n",
> > >>> + dev_name(sctrl->device));
> > >>> + nvme_put_ctrl_ccr(sctrl);
> > >>> + return 0;
> > >>> + }
> > >>> +
> > >>> + /* CCR failed, try another path */
> > >>> + min_cntlid = sctrl->cntlid + 1;
> > >>> + nvme_put_ctrl_ccr(sctrl);
> > >>> + now = jiffies;
> > >>> + }
> > >>
> > >> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
> > >> returns an error. _And_ if the CCR itself runs into a timeout we would
> > >> never have tried another path (which could have succeeded).
> > >
> > > True. We can do one thing at a time in CCR time budget. Either wait for
> > > CCR to succeed or give up early and try another path. It is a trade off.
> > >
> > Yes. But I guess my point here is that we should differentiate between
> > 'CCR failed to be sent' and 'CCR completed with error'.
> > The logic above treats both the same.
> >
> > >>
> > >> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
> > >> and only switch to the next controller if the submission of CCR failed.
> > >> Once that is done we can 'just' wait for completion, as a failure there
> > >> will be after KATO timeout anyway and any subsequent CCR would be pointless.
> > >
> > > If I understood this correctly then we will stick with the first sctrl
> > > that accepts the CCR command. We wait for CCR to complete and give up on
> > > fencing ictrl if CCR operation fails or times out. Did I get this correctly?
> > >
> > Yes.
> > If a CCR could be send but the controller failed to process it something
> > very odd is ongoing, and it's extremely questionable whether a CCR to
> > another controller would be succeeding. That's why I would switch to the
> > next available controller if we could not _send_ the CCR, but would
> > rather wait for KATO if CCR processing returned an error.
> >
> > But the main point is that CCR is a way to _shorten_ the interval
> > (until KATO timeout) until we can start retrying commands.
> > If the controller ran into an error during CCR processing chances
> > are that quite some time has elapsed already, and we might as well
> > wait for KATO instead of retrying with yet another CCR.
>
> Got it. I updated the code to do that.
It is not true that CCR failing means something odd is going on. In a
tightly-coupled storage HA pair, hopefully, all the NVMe controllers
will be able to figure out the status of the other NVMe controllers.
However, I know of multiple systems (one of which I care about) where
the NVMe controllers may have no way of figuring out the state of some
other NVMe controllers. In that case, the log page entry indicates
that the CCR might succeed on some other NVMe controller (and in these
systems, I expect they would not be able to be particularly specific
about which one). Very little time will elapse for that to happen.
It is important for those systems to have a retry on another NVMe
controller.
Sincerely,
Randy Jennings
Powered by blists - more mailing lists