[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241029-nvme-fc-handle-com-lost-v2-2-5b0d137e2a0a@kernel.org>
Date: Tue, 29 Oct 2024 19:40:12 +0100
From: Daniel Wagner <wagi@...nel.org>
To: James Smart <james.smart@...adcom.com>, Keith Busch <kbusch@...nel.org>,
Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
Hannes Reinecke <hare@...e.de>, Paul Ely <paul.ely@...adcom.com>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org,
Daniel Wagner <wagi@...nel.org>
Subject: [PATCH v2 2/2] nvme: handle connectivity loss in
nvme_set_queue_count
nvme_set_queue_count is called when setting up the IO queues. When
updating of the queue number fails, the function will ignore all NVME
errors. The assumption is that the controller is in degraded state, the
admin queue is up and running but not the IO queues.
In this state it's still possible to issues admin commands to the
controller to mitigate the problem, that's whay the controller
is allowed to enter the LIVE state.
Though by filtering out all error, it filters out a connectivity loss
event for fabric controllers:
1) nvme nvme10: NVME-FC{10}: create association : ...
2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Reconnect
nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
3) nvme nvme10: Could not set queue count (880)
nvme nvme10: Failed to configure AEN (cfg 900)
4) nvme nvme10: NVME-FC{10}: controller connect complete
5) nvme nvme10: failed nvme_keep_alive_end_io error=4
A new connection attempt is started 1) and while connecting the host
receives a connectivity loss event 2). 3) is the point where the connect
code observes a problem but ignores it and enters LIVE state at 4). The
keep alive command eventually times out 5) but again, this type of error
is ignored.
Note the status in nvme_keep_alive_end_io is EINTR, the real reason got
lost at this point (connectivity loss).
Thus catch the error early where we still have the exact reason why the
nvme_set_features has failed and bail out from there.
Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
Signed-off-by: Daniel Wagner <wagi@...nel.org>
---
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d09dbe71b2f1ac473ae687c4dc687..cc5ed6daf61f6cbc6fdf7b48687e25225bfd9f17 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1664,7 +1664,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
&result);
- if (status < 0)
+ /*
+ * It's either a kernel error or the host observed a connection
+ * lost. In either case it's not possible communicate with the
+ * controller and thus enter the error code path.
+ */
+ if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
return status;
/*
--
2.47.0
Powered by blists - more mailing lists