linux-kernel - Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPpK+O2a8uWa7M-7Kk=2xdhjDjWtBisz7o0yGPpah=iWrQTnNw@mail.gmail.com>
Date: Wed, 16 Apr 2025 17:47:07 -0700
From: Randy Jennings <randyj@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Daniel Wagner <dwagner@...e.de>, Mohamed Khalfella <mkhalfella@...estorage.com>, 
	Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>, 
	Hannes Reinecke <hare@...e.de>, John Meneghini <jmeneghi@...hat.com>, linux-nvme@...ts.infradead.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

On Wed, Apr 16, 2025 at 3:15 PM Sagi Grimberg <sagi@...mberg.me> wrote:
>
>
> >> CQT comes from the controller, and if it is high, it effectively means
> >> that the
> >> controller cannot handle faster failover reliably. So I think we should
> >> leave it
> >> as is. It is the vendor problem.
> > Okay, that is one way to approach it.  However, because of the hung
> > task issue, we would be allowing the vendor to panic the initiator
> > with a hung task.  Until CCR, and without implementing other checks
> > (for events which might not happen), this hung task would happen on
> > every messy disconnect with that vendor/array.
>
> Its kind of pick your poison situation I guess.
> We can log an error for controllers that expose overly long CQT...
That sounds like a good idea.

> Not sure we'll see a hung task here tho, its not like there is a kthread
> blocking
> on this, its a delayed work so I think the watchdog won't complain about
> it...
That is probably true with this patch set.

I believe controller reset (for instance, requested through sysfs) is
not supposed to finish until all the requests are no longer being
processed (at least if it should have the semantics of a controller
level reset from the spec).  This patch set may not tie the two
together on a disconnected controller, but I think it should.  Also,
if reconnection in error recovery is tied to this delay, as it is in
the patches Mohamed posted (https://lkml.org/lkml/2025/3/24/1136),
there were other things waiting on error recovery finishing.  Delaying
reconnection in error recovery until the requests are dead makes a lot
of sense to me.

Sincerely,
Randy Jennings