linux-kernel - [PATCH v2 2/2] nvme: handle connectivity loss in nvme_set_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20241029-nvme-fc-handle-com-lost-v2-2-5b0d137e2a0a@kernel.org>
Date: Tue, 29 Oct 2024 19:40:12 +0100
From: Daniel Wagner <wagi@...nel.org>
To: James Smart <james.smart@...adcom.com>, Keith Busch <kbusch@...nel.org>, 
 Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>, 
 Hannes Reinecke <hare@...e.de>, Paul Ely <paul.ely@...adcom.com>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org, 
 Daniel Wagner <wagi@...nel.org>
Subject: [PATCH v2 2/2] nvme: handle connectivity loss in
 nvme_set_queue_count

nvme_set_queue_count is called when setting up the IO queues. When
updating of the queue number fails, the function will ignore all NVME
errors. The assumption is that the controller is in degraded state, the
admin queue is up and running but not the IO queues.

In this state it's still possible to issues admin commands to the
controller to mitigate the problem, that's whay the controller
is allowed to enter the LIVE state.

Though by filtering out all error, it filters out a connectivity loss
event for fabric controllers:

     1) nvme nvme10: NVME-FC{10}: create association : ...
     2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Reconnect
        nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
     3) nvme nvme10: Could not set queue count (880)
        nvme nvme10: Failed to configure AEN (cfg 900)
     4) nvme nvme10: NVME-FC{10}: controller connect complete
     5) nvme nvme10: failed nvme_keep_alive_end_io error=4

A new connection attempt is started 1) and while connecting the host
receives a connectivity loss event 2). 3) is the point where the connect
code observes a problem but ignores it and enters LIVE state at 4). The
keep alive command eventually times out 5) but again, this type of error
is ignored.

Note the status in nvme_keep_alive_end_io is EINTR, the real reason got
lost at this point (connectivity loss).

Thus catch the error early where we still have the exact reason why the
nvme_set_features has failed and bail out from there.

Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
Signed-off-by: Daniel Wagner <wagi@...nel.org>
---
 drivers/nvme/host/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d09dbe71b2f1ac473ae687c4dc687..cc5ed6daf61f6cbc6fdf7b48687e25225bfd9f17 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1664,7 +1664,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)

 	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
 			&result);
-	if (status < 0)
+	/*
+	 * It's either a kernel error or the host observed a connection
+	 * lost. In either case it's not possible communicate with the
+	 * controller and thus enter the error code path.
+	 */
+	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
 		return status;

 	/*

-- 
2.47.0