linux-kernel - [PATCH v2 0/2] nvme-fc: fix race with connectivity loss and nvme_fc_create

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org>
Date: Tue, 29 Oct 2024 19:40:10 +0100
From: Daniel Wagner <wagi@...nel.org>
To: James Smart <james.smart@...adcom.com>, Keith Busch <kbusch@...nel.org>, 
 Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>, 
 Hannes Reinecke <hare@...e.de>, Paul Ely <paul.ely@...adcom.com>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org, 
 Daniel Wagner <wagi@...nel.org>
Subject: [PATCH v2 0/2] nvme-fc: fix race with connectivity loss and
 nvme_fc_create_association

We got a bug report that a controller was stuck in the connected state
after an association dropped.

It turns out that nvme_fc_create_association can succeed even though some
operation do fail. This is on purpose to handle the degraded controller
case, where the admin queue is up and running but not the io queues. In
this case the controller will still reach the LIVE state.

Unfortunatly, this will also ignore full connectivity loss for fabric
controllers. Let's address this by not filtering out all errors in
nvme_set_queue_count.

I haven't tested this version yet, as it needs a bit of tinker in my
setup. So the question is this a better approach? I think it would also
be great to hear from Paul if this works.

In theory the nvme_set_queue_count call still could pass and later
connectivity loss could happen, just before entering the LIVE state. In
this case the only thing to observe the connectivity loss is the keep
alive handler which currently does nothing. I think we should also
trigger a reset in this case. What do you think?

---
Changes in v2:
  - handle connection lost in nvme_set_queue_count directly
  - collected reviewed tags
  - Link to v1: https://lore.kernel.org/r/20240611190647.11856-1-dwagner@suse.de

---
Daniel Wagner (2):
      nvme-fc: go straight to connecting state when initializing
      nvme: handle connectivity loss in nvme_set_queue_count

 drivers/nvme/host/core.c | 7 ++++++-
 drivers/nvme/host/fc.c   | 3 +--
 2 files changed, 7 insertions(+), 3 deletions(-)
---
base-commit: 5e52f71f858eaff252a47530a5ad5e79309bd415
change-id: 20241029-nvme-fc-handle-com-lost-9b241936809a

Best regards,
-- 
Daniel Wagner <wagi@...nel.org>