[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260210232553.GR3729-mkhalfella@purestorage.com>
Date: Tue, 10 Feb 2026 15:25:53 -0800
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: James Smart <jsmart833426@...il.com>
Cc: Justin Tee <justin.tee@...adcom.com>,
Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
Paul Ely <paul.ely@...adcom.com>,
Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>,
Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>,
Sagi Grimberg <sagi@...mberg.me>,
Aaron Dailey <adailey@...estorage.com>,
Randy Jennings <randyj@...estorage.com>,
Dhaval Giani <dgiani@...estorage.com>,
Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> >> ...
> >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> >>> +{
> >>> + unsigned long deadline, now, timeout;
> >>> + struct nvme_ctrl *sctrl;
> >>> + u32 min_cntlid = 0;
> >>> + int ret;
> >>> +
> >>> + timeout = nvme_fence_timeout_ms(ictrl);
> >>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> >>> +
> >>> + now = jiffies;
> >>> + deadline = now + msecs_to_jiffies(timeout);
> >>> + while (time_before(now, deadline)) {
> >>
> >> Q: don't we have something to identify the controller's subsystem
> >> supports CCR before we starting selecting controllers and sending CCR ?
> >>
> >> I would think on older devices that don't support it we should be
> >> skipping this loop. The loop could delay the Time-Based delay without
> >> any CCR.
> >
> > I do not think we have something that identifies CCR support at
> > subsystem level. The spec defines CCRL at the controller level. The loop
> > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > not supported and nvme_fence_ctrl() will return immediately.
> >
> >>
> >> -- james
> >>
>
> I would think CCRL on the failed controller would be enough to assume
> the subsystem supports it.
ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
not think it is enough though. I say that for two reasons:
- May be this controller does not support CCR but others do on the same
subsystem. There is nothing prevents subsystem from putting a cap of
CCR at subsytem level.
- May be this controller supports CCR command but not now because all
CCR slots are used now. This can happen in the case of cascading
failure.
>
> I'm not worried about the coding on the host is so bad. It's more the
> multiple paths that must have cmds sent to them and getting error
> responses for unknown cmds (should be responded to ok, but you never
> know) as well as creating conditions for other errors where there will
> be no return for it - e.g. other paths losing connectivity while the ccr
> outstanding, etc. yes, they all have to work, but why bother adding
> these flows to an old controller that would never do CCR ?
If nvme_find_ctrl_ccr() returns a source controller to use then we know
the controller supports CCR and does have an available slot to process
this CCR request. I do not see how this code will send CCR request to an
old controller that does not know about CCR command.
I am not fully opposed against using ictrl->ccr_limit to return early. I
do not see the need for it. If you feel strongly about it I can update
nvme_fence_ctrl() to do so.
Powered by blists - more mailing lists