linux-kernel - Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <693187ac-9fe2-4ba3-8fcf-e34204fe7247@flourine.local>
Date: Tue, 7 Jan 2025 15:38:38 +0100
From: Daniel Wagner <dwagner@...e.de>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Daniel Wagner <wagi@...nel.org>, 
	James Smart <james.smart@...adcom.com>, Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>, 
	Hannes Reinecke <hare@...e.de>, Paul Ely <paul.ely@...adcom.com>, linux-nvme@...ts.infradead.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails

On Tue, Dec 24, 2024 at 12:31:35PM +0200, Sagi Grimberg wrote:
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> >   		dev_err(ctrl->device,
> >   			"failed nvme_keep_alive_end_io error=%d\n",
> >   				status);
> > +		/*
> > +		 * The driver reports that we lost the connection,
> > +		 * trigger a recovery.
> > +		 */
> > +		if (status == BLK_STS_TRANSPORT)
> > +			nvme_reset_ctrl(ctrl);
> >   		return RQ_END_IO_NONE;
> >   	}
> > 
> 
> A lengthy explanation that results in nvme core behavior that assumes a very
> specific driver behavior.

I tried to explain exactly what's going on, so we can discuss possible
solutions without communicating past each other.

In the meantime I started on a patch set for the TP4129 related changes
in the spec (KATO Corrections and Clarifications). These changes would
also depend on the kato timeout handler triggering a reset.

I am fine with dropping this change for now and discuss it in the light
of TP4129 if this is what you prefer?

> Isn't the root of the problem that FC is willing to live
> peacefully with a controller
> without any queues/connectivity to it without periodically reconnecting?

The root problem is that the connect lost event gets ignored in the
CONNECTING state for the first connection attempt. All will work fine
for RECONNECTING state.

Maybe something like this instead? (untested)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index c4cbe3ce81f7..1f1d1d62a978 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -148,6 +148,7 @@ struct nvme_fc_rport {
 #define ASSOC_ACTIVE		0
 #define ASSOC_FAILED		1
 #define FCCTRL_TERMIO		2
+#define CONNECTIVITY_LOST	3

 struct nvme_fc_ctrl {
 	spinlock_t		lock;
@@ -785,6 +786,8 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 		"NVME-FC{%d}: controller connectivity lost. Awaiting "
 		"Reconnect", ctrl->cnum);

+	set_bit(CONNECTIVITY_LOST, &ctrl->flags);
+
 	switch (nvme_ctrl_state(&ctrl->ctrl)) {
 	case NVME_CTRL_NEW:
 	case NVME_CTRL_LIVE:
@@ -3071,6 +3074,8 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 	if (nvme_fc_ctlr_active_on_rport(ctrl))
 		return -ENOTUNIQ;

+	clear_bit(CONNECTIVITY_LOST, &ctrl->flags);
+
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: create association : host wwpn 0x%016llx "
 		" rport wwpn 0x%016llx: NQN \"%s\"\n",
@@ -3174,6 +3179,11 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)

 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);

+	if (test_bit(CONNECTIVITY_LOST, &ctrl->flags)) {
+		ret = -EIO;
+		goto out_term_aeo_ops;
+	}
+
 	ctrl->ctrl.nr_reconnects = 0;

 	if (changed)