linux-kernel - Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPpK+O37eh5tKFcos61DKH8pYsFtw-mcWLw+ZVQRJKdNtJCp-g@mail.gmail.com>
Date: Tue, 10 Feb 2026 19:33:18 -0800
From: Randy Jennings <randyj@...estorage.com>
To: Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: James Smart <jsmart833426@...il.com>, Justin Tee <justin.tee@...adcom.com>, 
	Naresh Gottumukkala <nareshgottumukkala83@...il.com>, Paul Ely <paul.ely@...adcom.com>, 
	Chaitanya Kulkarni <kch@...dia.com>, Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>, 
	Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>, 
	Aaron Dailey <adailey@...estorage.com>, Dhaval Giani <dgiani@...estorage.com>, 
	Hannes Reinecke <hare@...e.de>, linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery

On Tue, Feb 10, 2026 at 4:12 PM Mohamed Khalfella
<mkhalfella@...estorage.com> wrote:
>
> On Tue 2026-02-10 15:25:55 -0800, Mohamed Khalfella wrote:
> > On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> > > On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > > > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> > > >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > > >> ...
> > > >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > > >>> +{
> > > >>> +       unsigned long deadline, now, timeout;
> > > >>> +       struct nvme_ctrl *sctrl;
> > > >>> +       u32 min_cntlid = 0;
> > > >>> +       int ret;
> > > >>> +
> > > >>> +       timeout = nvme_fence_timeout_ms(ictrl);
> > > >>> +       dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > > >>> +
> > > >>> +       now = jiffies;
> > > >>> +       deadline = now + msecs_to_jiffies(timeout);
> > > >>> +       while (time_before(now, deadline)) {
> > > >>
> > > >> Q: don't we have something to identify the controller's subsystem
> > > >> supports CCR before we starting selecting controllers and sending CCR ?
> > > >>
> > > >> I would think on older devices that don't support it we should be
> > > >> skipping this loop.   The loop could delay the Time-Based delay without
> > > >> any CCR.
> > > >
> > > > I do not think we have something that identifies CCR support at
> > > > subsystem level. The spec defines CCRL at the controller level. The loop
> > > > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > > > not supported and nvme_fence_ctrl() will return immediately.
> > > >
> > > >>
> > > >> -- james
> > > >>
> > >
> > > I would think CCRL on the failed controller would be enough to assume
> > > the subsystem supports it.
> >
> > ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
> > not think it is enough though. I say that for two reasons:
> >
> > - May be this controller does not support CCR but others do on the same
> >   subsystem. There is nothing prevents subsystem from putting a cap of
> >   CCR at subsytem level.
This is not a concern.  Controllers should support CCR uniformly in
a subsystem, and it would be unusual for them not to.

> > - May be this controller supports CCR command but not now because all
> >   CCR slots are used now. This can happen in the case of cascading
> >   failure.
This is the concern.  Because the code modifies ccrl as it has ccr cmds
outstanding, checking ictrl may not give an accurate picture if ccr is
supported.


> >
> > >
> > > I'm not worried about the coding on the host is so bad. It's more the
> > > multiple paths that must have cmds sent to them and getting error
> > > responses for unknown cmds (should be responded to ok, but you never
> > > know) as well as creating conditions for other errors where there will
> > > be no return for it - e.g. other paths losing connectivity while the ccr
> > > outstanding, etc. yes, they all have to work, but why bother adding
> > > these flows to an old controller that would never do CCR ?
> >
> > If nvme_find_ctrl_ccr() returns a source controller to use then we know
> > the controller supports CCR and does have an available slot to process
> > this CCR request. I do not see how this code will send CCR request to an
> > old controller that does not know about CCR command.
> >
> > I am not fully opposed against using ictrl->ccr_limit to return early. I
> > do not see the need for it. If you feel strongly about it I can update
> > nvme_fence_ctrl() to do so.
> >
>
> I forgot to mention that ctrl->ccr_limit is initialized from id->ccrl in
> nvme_init_identify(). If this value is greater than zero then we know
> the controller does support CCR. nvme_find_ctrl_ccr() checks for that
> and the returned source controller must support CCR and has a slot
> available for it.