lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <81f06615-2037-4177-b38e-d67c8e7ca95b@flourine.local>
Date: Tue, 15 Apr 2025 14:00:47 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Hannes Reinecke <hare@...e.de>
Cc: Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, Keith Busch <kbusch@...nel.org>, 
	John Meneghini <jmeneghi@...hat.com>, randyj@...estorage.com, 
	Mohamed Khalfella <mkhalfella@...estorage.com>, linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

On Tue, Apr 01, 2025 at 11:37:29AM +0200, Hannes Reinecke wrote:
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2345,6 +2345,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   	nvme_stop_keep_alive(ctrl);
> >   	flush_work(&ctrl->async_event_work);
> > +	nvme_schedule_failover(ctrl);
> >   	nvme_tcp_teardown_io_queues(ctrl, false);
> >   	/* unquiesce to fail fast pending requests */
> >   	nvme_unquiesce_io_queues(ctrl);
> > 
> Hmm. Rather not.
> 
> Why do we have to have a separate failover queue?

This RFC plays with the idea to handle the request which timeout on
ctrl level. The main point is to avoid touching every single transport.

An additional failover queue is likely to introduce new problems and
additional complexity. There is no free lunch. So I don't think it's a
good concept afterall.

And as discussed during LSFMM I think it would best to focus on factor
out the common code (the project I wanted to work on for a while...)
from the transports first and then figure out how to get the CQT, CCF
working.

> Can't we simply delay the error recovery by the cqt value?

Yes and no. As we know from our in house testing, it fixes the problem
for customers but it is not spec compliant.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ