[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220331120744.sb4ai6pa2ahtb3c5@carbon.lan>
Date: Thu, 31 Mar 2022 14:07:44 +0200
From: Daniel Wagner <dwagner@...e.de>
To: "Belanger, Martin" <Martin.Belanger@...l.com>
Cc: Oliver O'Halloran <oohall@...il.com>,
Tanjore Suresh <tansuresh@...gle.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Christoph Hellwig <hch@....de>,
Sagi Grimberg <sagi@...mberg.me>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
linux-pci <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v1 1/3] driver core: Support asynchronous driver shutdown
On Wed, Mar 30, 2022 at 02:12:18PM +0000, Belanger, Martin wrote:
> I know this patch is mainly for PCI devices, however, NVMe over Fabrics
> devices can suffer even longer shutdowns. Last September, I reported
> that shutting down an NVMe-oF TCP connection while the network is down
> will result in a 1-minute deadlock. That's because the driver tries to perform
> a proper shutdown by sending commands to the remote target and the
> timeout for unanswered commands is 1-minute. If one needs to shut down
> several NVMe-oF connections, each connection will be shut down sequentially
> taking each 1 minute. Try running "nvme disconnect-all" while the network
> is down and you'll see what I mean. Of course, the KATO is supposed to
> detect when connectivity is lost, but if you have a long KATO (e.g. 2 minutes)
> you will most likely hit this condition.
I've debugging something similar:
[44888.710527] nvme nvme0: Removing ctrl: NQN "xxx"
[44898.981684] nvme nvme0: failed to send request -32
[44960.982977] nvme nvme0: queue 0: timeout request 0x18 type 4
[44960.983099] nvme nvme0: Property Set error: 881, offset 0x14
Currently testing this patch:
+++ b/drivers/nvme/host/tcp.c
@@ -1103,9 +1103,12 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
if (ret == -EAGAIN) {
ret = 0;
} else if (ret < 0) {
+ struct request *rq = blk_mq_rq_from_pdu(queue->request);
+
dev_err(queue->ctrl->ctrl.device,
"failed to send request %d\n", ret);
- if (ret != -EPIPE && ret != -ECONNRESET)
+ if ((ret != -EPIPE && ret != -ECONNRESET) ||
+ rq->cmd_flags & REQ_FAILFAST_DRIVER)
nvme_tcp_fail_request(queue->request);
nvme_tcp_done_send_req(queue);
}
Powered by blists - more mailing lists