linux-kernel - Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <81f06615-2037-4177-b38e-d67c8e7ca95b@flourine.local>
Date: Tue, 15 Apr 2025 14:00:47 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Hannes Reinecke <hare@...e.de>
Cc: Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, Keith Busch <kbusch@...nel.org>, 
	John Meneghini <jmeneghi@...hat.com>, randyj@...estorage.com, 
	Mohamed Khalfella <mkhalfella@...estorage.com>, linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

On Tue, Apr 01, 2025 at 11:37:29AM +0200, Hannes Reinecke wrote:
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2345,6 +2345,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   	nvme_stop_keep_alive(ctrl);
> >   	flush_work(&ctrl->async_event_work);
> > +	nvme_schedule_failover(ctrl);
> >   	nvme_tcp_teardown_io_queues(ctrl, false);
> >   	/* unquiesce to fail fast pending requests */
> >   	nvme_unquiesce_io_queues(ctrl);
> > 
> Hmm. Rather not.
> 
> Why do we have to have a separate failover queue?

This RFC plays with the idea to handle the request which timeout on
ctrl level. The main point is to avoid touching every single transport.

An additional failover queue is likely to introduce new problems and
additional complexity. There is no free lunch. So I don't think it's a
good concept afterall.

And as discussed during LSFMM I think it would best to focus on factor
out the common code (the project I wanted to work on for a while...)
from the transports first and then figure out how to get the CQT, CCF
working.

> Can't we simply delay the error recovery by the cqt value?

Yes and no. As we know from our in house testing, it fixes the problem
for customers but it is not spec compliant.