lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org>
Date: Tue, 29 Oct 2024 19:40:10 +0100
From: Daniel Wagner <wagi@...nel.org>
To: James Smart <james.smart@...adcom.com>, Keith Busch <kbusch@...nel.org>, 
 Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>, 
 Hannes Reinecke <hare@...e.de>, Paul Ely <paul.ely@...adcom.com>
Cc: linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org, 
 Daniel Wagner <wagi@...nel.org>
Subject: [PATCH v2 0/2] nvme-fc: fix race with connectivity loss and
 nvme_fc_create_association

We got a bug report that a controller was stuck in the connected state
after an association dropped.

It turns out that nvme_fc_create_association can succeed even though some
operation do fail. This is on purpose to handle the degraded controller
case, where the admin queue is up and running but not the io queues. In
this case the controller will still reach the LIVE state.

Unfortunatly, this will also ignore full connectivity loss for fabric
controllers. Let's address this by not filtering out all errors in
nvme_set_queue_count.

I haven't tested this version yet, as it needs a bit of tinker in my
setup. So the question is this a better approach? I think it would also
be great to hear from Paul if this works.

In theory the nvme_set_queue_count call still could pass and later
connectivity loss could happen, just before entering the LIVE state. In
this case the only thing to observe the connectivity loss is the keep
alive handler which currently does nothing. I think we should also
trigger a reset in this case. What do you think?

---
Changes in v2:
  - handle connection lost in nvme_set_queue_count directly
  - collected reviewed tags
  - Link to v1: https://lore.kernel.org/r/20240611190647.11856-1-dwagner@suse.de

---
Daniel Wagner (2):
      nvme-fc: go straight to connecting state when initializing
      nvme: handle connectivity loss in nvme_set_queue_count

 drivers/nvme/host/core.c | 7 ++++++-
 drivers/nvme/host/fc.c   | 3 +--
 2 files changed, 7 insertions(+), 3 deletions(-)
---
base-commit: 5e52f71f858eaff252a47530a5ad5e79309bd415
change-id: 20241029-nvme-fc-handle-com-lost-9b241936809a

Best regards,
-- 
Daniel Wagner <wagi@...nel.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ