linux-kernel - Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <12892dbf-9cdc-48c3-a67d-0c626d54100a@flourine.local>
Date: Wed, 16 Apr 2025 10:51:38 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>, 
	Keith Busch <kbusch@...nel.org>, Hannes Reinecke <hare@...e.de>, 
	John Meneghini <jmeneghi@...hat.com>, randyj@...estorage.com, 
	Mohamed Khalfella <mkhalfella@...estorage.com>, linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

On Mon, Apr 14, 2025 at 01:03:57AM +0300, Sagi Grimberg wrote:
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -239,6 +239,7 @@ static void nvme_do_delete_ctrl(struct nvme_ctrl *ctrl)
> >   	flush_work(&ctrl->reset_work);
> >   	nvme_stop_ctrl(ctrl);
> > +	nvme_flush_failover(ctrl);
> 
> This will terminate all pending failvoer commands - this is the intended
> behavior?

Yes it will, I don't think so. From the feedback I gather so far, it
seems the correct way (spec) is to handle each command individually.

FWIW, we are shipping

  https://lore.kernel.org/linux-nvme/20230908100049.80809-3-hare@suse.de/

as stop-gap solution for a while and our customer wasn't able to
reproduce the ghost writes anymore. And as far I can tell it does also
failover all pending commands.

> 
> >   	nvme_remove_namespaces(ctrl);
> >   	ctrl->ops->delete_ctrl(ctrl);
> >   	nvme_uninit_ctrl(ctrl);
> > @@ -1310,6 +1311,19 @@ static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
> >   	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
> >   }
> > +void nvme_schedule_failover(struct nvme_ctrl *ctrl)
> > +{
> > +	unsigned long delay;
> > +
> > +	if (ctrl->cqt)
> > +		delay = msecs_to_jiffies(ctrl->cqt);
> > +	else
> > +		delay = ctrl->kato * HZ;
> 
> This delay was there before?

No, this is function returns the additional time the host should wait
for it starts to failover. So this is the CQT value, after the KATO
timeout protocol and this timeout is not present in the nvme subsystem.

> >   void nvme_failover_req(struct request *req)
> >   {
> >   	struct nvme_ns *ns = req->q->queuedata;
> > +	struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
> >   	u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK;
> >   	unsigned long flags;
> >   	struct bio *bio;
> > +	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
> >   	nvme_mpath_clear_current_path(ns);
> > @@ -121,9 +123,53 @@ void nvme_failover_req(struct request *req)
> >   	blk_steal_bios(&ns->head->requeue_list, req);
> >   	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> > -	nvme_req(req)->status = 0;
> > -	nvme_end_req(req);
> > -	kblockd_schedule_work(&ns->head->requeue_work);
> > +	spin_lock_irqsave(&ctrl->lock, flags);
> > +	list_add_tail(&req->queuelist, &ctrl->failover_list);
> > +	spin_unlock_irqrestore(&ctrl->lock, flags);
> 
> So these request - which we stripped their bios, are now placed
> on a failover queue?

Yes, this is the idea.

> > +
> > +	if (state == NVME_CTRL_DELETING) {
> > +		/*
> > +		 * request which fail over in the DELETING state were
> > +		 * canceled and blk_mq_tagset_wait_completed_request will
> > +		 * block until they have been proceed. Though
> > +		 * nvme_failover_work is already stopped. Thus schedule
> > +		 * a failover; it's still necessary to delay these commands
> > +		 * by CQT.
> > +		 */
> > +		nvme_schedule_failover(ctrl);
> 
> This condition is unclear. Isn't any request failing over should do this?
> Something is unclear here.

Ye, the comment is not very clear. Sorry about that. I observed that
blk_mq_tagset_wait_completed_request would block when the controller was
already in DELETING state and request were canceled, IIRC. The commands
would be queued on the failover queue but never processed
nvme_failover_work.

> > +	nvme_schedule_failover(&ctrl->ctrl);
> >   	nvme_rdma_teardown_io_queues(ctrl, shutdown);
> >   	nvme_quiesce_admin_queue(&ctrl->ctrl);
> >   	nvme_disable_ctrl(&ctrl->ctrl, shutdown);
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index d0023bcfd8a79a193adf2807a24481c8c164a174..3a6c1d3febaf233996e4dcf684793327b5d1412f 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2345,6 +2345,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   	nvme_stop_keep_alive(ctrl);
> >   	flush_work(&ctrl->async_event_work);
> > +	nvme_schedule_failover(ctrl);
> >   	nvme_tcp_teardown_io_queues(ctrl, false);
> >   	/* unquiesce to fail fast pending requests */
> >   	nvme_unquiesce_io_queues(ctrl);
> > 
> 
> What is the reason for rdma and tcp not being identical?

Sorry, can't remember. But I really want to start moving this code into
one place.