lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpK+O2a8uWa7M-7Kk=2xdhjDjWtBisz7o0yGPpah=iWrQTnNw@mail.gmail.com>
Date: Wed, 16 Apr 2025 17:47:07 -0700
From: Randy Jennings <randyj@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Daniel Wagner <dwagner@...e.de>, Mohamed Khalfella <mkhalfella@...estorage.com>, 
	Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>, 
	Hannes Reinecke <hare@...e.de>, John Meneghini <jmeneghi@...hat.com>, linux-nvme@...ts.infradead.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

On Wed, Apr 16, 2025 at 3:15 PM Sagi Grimberg <sagi@...mberg.me> wrote:
>
>
> >> CQT comes from the controller, and if it is high, it effectively means
> >> that the
> >> controller cannot handle faster failover reliably. So I think we should
> >> leave it
> >> as is. It is the vendor problem.
> > Okay, that is one way to approach it.  However, because of the hung
> > task issue, we would be allowing the vendor to panic the initiator
> > with a hung task.  Until CCR, and without implementing other checks
> > (for events which might not happen), this hung task would happen on
> > every messy disconnect with that vendor/array.
>
> Its kind of pick your poison situation I guess.
> We can log an error for controllers that expose overly long CQT...
That sounds like a good idea.

> Not sure we'll see a hung task here tho, its not like there is a kthread
> blocking
> on this, its a delayed work so I think the watchdog won't complain about
> it...
That is probably true with this patch set.

I believe controller reset (for instance, requested through sysfs) is
not supposed to finish until all the requests are no longer being
processed (at least if it should have the semantics of a controller
level reset from the spec).  This patch set may not tie the two
together on a disconnected controller, but I think it should.  Also,
if reconnection in error recovery is tied to this delay, as it is in
the patches Mohamed posted (https://lkml.org/lkml/2025/3/24/1136),
there were other things waiting on error recovery finishing.  Delaying
reconnection in error recovery until the requests are dead makes a lot
of sense to me.

Sincerely,
Randy Jennings

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ