[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <144a314b-75b8-4d0c-ae19-40680cb681d1@flourine.local>
Date: Tue, 6 May 2025 18:36:14 +0200
From: Daniel Wagner <dwagner@...e.de>
To: linux-nvme@...ts.infradead.org
Cc: James Smart <james.smart@...adcom.com>, Christoph Hellwig <hch@....de>,
Sagi Grimberg <sagi@...mberg.me>, Chaitanya Kulkarni <kch@...dia.com>,
Hannes Reinecke <hare@...e.de>, Keith Busch <kbusch@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 00/14] nvmet-fcloop: track resources via reference
counting
On Wed, Apr 23, 2025 at 03:21:43PM +0200, Daniel Wagner wrote:
> Note blktests nvme/030 test is likely to fail if the
> 70-nvmf-autoconnect.rules is active. In this case two discovery are
> running in parallel and nvme-cli/libnvme gets out of sync. I don't see a
> problem in blktests, but maybe I am just blind:
>
> nvme/030 (tr=fc) (ensure the discovery generation counter is updated appropriately) [failed]
> runtime 1.843s ... 1.719s
> --- tests/nvme/030.out 2023-08-30 08:39:08.428409596 +0000
> +++ /tmp/blktests/nodev_tr_fc/nvme/030.out.bad 2025-04-10 10:56:05.146372112 +0000
> @@ -1,2 +1,6 @@
> Running nvme/030
> +Failed to open ctrl nvme1, errno 11
> +Failed to open ctrl nvme1, errno 11
> +failed to get discovery log: Bad file descriptor
It turns out that nvme/030 uncovered a bunch of bugs. First the kernel
returned EAGAIN consistently for a while and could easily reproduce it.
But after updating something it went away. I think the EAGAIN was issued
because in my test setup the udev rule is active and triggers a discover
(creates a discover ctrl) which runs in parallel with the test, also
running a discover. I think EAGAIN was alwasys there but it is hard to
hit.
I've added a workaround to handle EINTR to libnvme but after reading up
on signals, I came to the conclusion, nvme-cli needs to handle EAGAIN
and EINTR. The EINTR case might be entered with Ctrl-C and in this case
we want to terminate the loop. Installing a signal handler in a library
is a no go from my understanding:
https://github.com/linux-nvme/nvme-cli/pull/2797
After getting this out of the way, I figured out that nvmet-fc is not
able to handle more than on in flight async, and there is a nested
locking issue in nvme-fc.
The tests get more and more reliable, though I thought I saw a KASAN
report but now it's not reproducing. Yeah, everyone loves heisenbugs.
Powered by blists - more mailing lists