[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpK+O2a8uWa7M-7Kk=2xdhjDjWtBisz7o0yGPpah=iWrQTnNw@mail.gmail.com>
Date: Wed, 16 Apr 2025 17:47:07 -0700
From: Randy Jennings <randyj@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Daniel Wagner <dwagner@...e.de>, Mohamed Khalfella <mkhalfella@...estorage.com>,
Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>,
Hannes Reinecke <hare@...e.de>, John Meneghini <jmeneghi@...hat.com>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout
On Wed, Apr 16, 2025 at 3:15 PM Sagi Grimberg <sagi@...mberg.me> wrote:
>
>
> >> CQT comes from the controller, and if it is high, it effectively means
> >> that the
> >> controller cannot handle faster failover reliably. So I think we should
> >> leave it
> >> as is. It is the vendor problem.
> > Okay, that is one way to approach it. However, because of the hung
> > task issue, we would be allowing the vendor to panic the initiator
> > with a hung task. Until CCR, and without implementing other checks
> > (for events which might not happen), this hung task would happen on
> > every messy disconnect with that vendor/array.
>
> Its kind of pick your poison situation I guess.
> We can log an error for controllers that expose overly long CQT...
That sounds like a good idea.
> Not sure we'll see a hung task here tho, its not like there is a kthread
> blocking
> on this, its a delayed work so I think the watchdog won't complain about
> it...
That is probably true with this patch set.
I believe controller reset (for instance, requested through sysfs) is
not supposed to finish until all the requests are no longer being
processed (at least if it should have the semantics of a controller
level reset from the spec). This patch set may not tie the two
together on a disconnected controller, but I think it should. Also,
if reconnection in error recovery is tied to this delay, as it is in
the patches Mohamed posted (https://lkml.org/lkml/2025/3/24/1136),
there were other things waiting on error recovery finishing. Delaying
reconnection in error recovery until the requests are dead makes a lot
of sense to me.
Sincerely,
Randy Jennings
Powered by blists - more mailing lists