linux-kernel - Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d651af57-89aa-43fa-b7a4-6c4315fbd996@suse.de>
Date: Wed, 11 Feb 2026 16:19:19 +0100
From: Hannes Reinecke <hare@...e.de>
To: Randy Jennings <randyj@...estorage.com>,
 Mohamed Khalfella <mkhalfella@...estorage.com>
Cc: Justin Tee <justin.tee@...adcom.com>,
 Naresh Gottumukkala <nareshgottumukkala83@...il.com>,
 Paul Ely <paul.ely@...adcom.com>, Chaitanya Kulkarni <kch@...dia.com>,
 Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
 Keith Busch <kbusch@...nel.org>, Sagi Grimberg <sagi@...mberg.me>,
 Aaron Dailey <adailey@...estorage.com>, Dhaval Giani
 <dgiani@...estorage.com>, linux-nvme@...ts.infradead.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery

On 2/11/26 04:44, Randy Jennings wrote:
> On Wed, Feb 4, 2026 at 3:24 PM Mohamed Khalfella
> <mkhalfella@...estorage.com> wrote:
>>
>> On Wed 2026-02-04 02:10:48 +0100, Hannes Reinecke wrote:
>>> On 2/3/26 21:00, Mohamed Khalfella wrote:
>>>> On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> [ .. ]
>>>>>> + timeout = nvme_fence_timeout_ms(ictrl);
>>>>>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
>>>>>> +
>>>>>> + now = jiffies;
>>>>>> + deadline = now + msecs_to_jiffies(timeout);
>>>>>> + while (time_before(now, deadline)) {
>>>>>> +         sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
>>>>>> +         if (!sctrl) {
>>>>>> +                 /* CCR failed, switch to time-based recovery */
>>>>>> +                 return deadline - now;
>>>>>> +         }
>>>>>> +
>>>>>> +         ret = nvme_issue_wait_ccr(sctrl, ictrl);
>>>>>> +         if (!ret) {
>>>>>> +                 dev_info(ictrl->device, "CCR succeeded using %s\n",
>>>>>> +                          dev_name(sctrl->device));
>>>>>> +                 nvme_put_ctrl_ccr(sctrl);
>>>>>> +                 return 0;
>>>>>> +         }
>>>>>> +
>>>>>> +         /* CCR failed, try another path */
>>>>>> +         min_cntlid = sctrl->cntlid + 1;
>>>>>> +         nvme_put_ctrl_ccr(sctrl);
>>>>>> +         now = jiffies;
>>>>>> + }
>>>>>
>>>>> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
>>>>> returns an error. _And_ if the CCR itself runs into a timeout we would
>>>>> never have tried another path (which could have succeeded).
>>>>
>>>> True. We can do one thing at a time in CCR time budget. Either wait for
>>>> CCR to succeed or give up early and try another path. It is a trade off.
>>>>
>>> Yes. But I guess my point here is that we should differentiate between
>>> 'CCR failed to be sent' and 'CCR completed with error'.
>>> The logic above treats both the same.
>>>
>>>>>
>>>>> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
>>>>> and only switch to the next controller if the submission of CCR failed.
>>>>> Once that is done we can 'just' wait for completion, as a failure there
>>>>> will be after KATO timeout anyway and any subsequent CCR would be pointless.
>>>>
>>>> If I understood this correctly then we will stick with the first sctrl
>>>> that accepts the CCR command. We wait for CCR to complete and give up on
>>>> fencing ictrl if CCR operation fails or times out. Did I get this correctly?
>>>>
>>> Yes.
>>> If a CCR could be send but the controller failed to process it something
>>> very odd is ongoing, and it's extremely questionable whether a CCR to
>>> another controller would be succeeding. That's why I would switch to the
>>> next available controller if we could not _send_ the CCR, but would
>>> rather wait for KATO if CCR processing returned an error.
>>>
>>> But the main point is that CCR is a way to _shorten_ the interval
>>> (until KATO timeout) until we can start retrying commands.
>>> If the controller ran into an error during CCR processing chances
>>> are that quite some time has elapsed already, and we might as well
>>> wait for KATO instead of retrying with yet another CCR.
>>
>> Got it. I updated the code to do that.
> It is not true that CCR failing means something odd is going on.  In a
> tightly-coupled storage HA pair, hopefully, all the NVMe controllers
> will be able to figure out the status of the other NVMe controllers.
> However, I know of multiple systems (one of which I care about) where
> the NVMe controllers may have no way of figuring out the state of some
> other NVMe controllers.  In that case, the log page entry indicates
> that the CCR might succeed on some other NVMe controller (and in these
> systems, I expect they would not be able to be particularly specific
> about which one).  Very little time will elapse for that to happen.
> 
> It is important for those systems to have a retry on another NVMe
> controller.
> 
Ah, well; that's me being mainly focused on command timeouts.
If we get an NVMe status back indicating we should retry on
another controller then clearly we should be doing that.
The comment above was primarily geared for a CCR command for
which we do _not_ get a result back.

Or, put it another way: as long as we're within the KATO timeout
range we should retry the CCR command on another path.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@...e.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich