linux-kernel - Re: [PATCH v5 00/14] nvmet-fcloop: track resources via reference counting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <144a314b-75b8-4d0c-ae19-40680cb681d1@flourine.local>
Date: Tue, 6 May 2025 18:36:14 +0200
From: Daniel Wagner <dwagner@...e.de>
To: linux-nvme@...ts.infradead.org
Cc: James Smart <james.smart@...adcom.com>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, Chaitanya Kulkarni <kch@...dia.com>, 
	Hannes Reinecke <hare@...e.de>, Keith Busch <kbusch@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 00/14] nvmet-fcloop: track resources via reference
 counting

On Wed, Apr 23, 2025 at 03:21:43PM +0200, Daniel Wagner wrote:
> Note blktests nvme/030 test is likely to fail if the
> 70-nvmf-autoconnect.rules is active. In this case two discovery are
> running in parallel and nvme-cli/libnvme gets out of sync. I don't see a
> problem in blktests, but maybe I am just blind:
> 
> nvme/030 (tr=fc) (ensure the discovery generation counter is updated appropriately) [failed]
>     runtime  1.843s  ...  1.719s
>     --- tests/nvme/030.out      2023-08-30 08:39:08.428409596 +0000
>     +++ /tmp/blktests/nodev_tr_fc/nvme/030.out.bad      2025-04-10 10:56:05.146372112 +0000
>     @@ -1,2 +1,6 @@
>      Running nvme/030
>     +Failed to open ctrl nvme1, errno 11
>     +Failed to open ctrl nvme1, errno 11
>     +failed to get discovery log: Bad file descriptor

It turns out that nvme/030 uncovered a bunch of bugs. First the kernel
returned EAGAIN consistently for a while and could easily reproduce it.
But after updating something it went away. I think the EAGAIN was issued
because in my test setup the udev rule is active and triggers a discover
(creates a discover ctrl) which runs in parallel with the test, also
running a discover. I think EAGAIN was alwasys there but it is hard to
hit.

I've added a workaround to handle EINTR to libnvme but after reading up
on signals, I came to the conclusion, nvme-cli needs to handle EAGAIN
and EINTR. The EINTR case might be entered with Ctrl-C and in this case
we want to terminate the loop. Installing a signal handler in a library
is a no go from my understanding:

  https://github.com/linux-nvme/nvme-cli/pull/2797

After getting this out of the way, I figured out that nvmet-fc is not
able to handle more than on in flight async, and there is a nested
locking issue in nvme-fc.

The tests get more and more reliable, though I thought I saw a KASAN
report but now it's not reproducing. Yeah, everyone loves heisenbugs.