[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08f3d804-f94b-4a2f-897b-7fee3411e6fc@suse.de>
Date: Thu, 7 Mar 2024 12:45:17 +0100
From: Hannes Reinecke <hare@...e.de>
To: Sagi Grimberg <sagi@...mberg.me>, Daniel Wagner <dwagner@...e.de>,
James Smart <james.smart@...adcom.com>
Cc: Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/2] nvme-fabrics: short-circuit connect retries
On 3/7/24 12:30, Sagi Grimberg wrote:
>
>
> On 07/03/2024 12:37, Hannes Reinecke wrote:
>> On 3/7/24 09:00, Sagi Grimberg wrote:
>>>
>>> On 05/03/2024 10:00, Daniel Wagner wrote:
>>>> I've picked up Hannes' DNR patches. In short the make the transports
>>>> behave the same way when the DNR bit set on a re-connect attempt. We
>>>> had a discussion this
>>>> topic in the past and if I got this right we all agreed is that the
>>>> host should honor the DNR bit on a connect attempt [1]
>>> Umm, I don't recall this being conclusive though. The spec ought to
>>> be clearer here I think.
>>
>> I've asked the NVMexpress fmds group, and the response was pretty
>> unanimous that the DNR bit on connect should be evaluated.
>
> OK.
>
>>
>>>>
>>>> The nvme/045 test case (authentication tests) in blktests is a good
>>>> test case for this after extending it slightly. TCP and RDMA try to
>>>> reconnect with an
>>>> invalid key over and over again, while loop and FC stop after the
>>>> first fail.
>>>
>>> Who says that invalid key is a permanent failure though?
>>>
>> See the response to the other patchset.
>> 'Invalid key' in this context means that the _client_ evaluated the
>> key as invalid, ie the key is unusable for the client.
>> As the key is passed in via the commandline there is no way the client
>> can ever change the value here, and no amount of retry will change
>> things here. That's what we try to fix.
>
> Where is this retried today, I don't see where connect failure is
> retried, outside of a periodic reconnect.
> Maybe I'm missing where what is the actual failure here.
static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
{
struct nvme_tcp_ctrl *tcp_ctrl =
container_of(to_delayed_work(work),
struct nvme_tcp_ctrl, connect_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
++ctrl->nr_reconnects;
if (nvme_tcp_setup_ctrl(ctrl, false))
goto requeue;
dev_info(ctrl->device, "Successfully reconnected (%d attempt)\n",
ctrl->nr_reconnects);
ctrl->nr_reconnects = 0;
return;
requeue:
dev_info(ctrl->device, "Failed reconnect attempt %d\n",
and nvme_tcp_setup_ctrl() returns either a negative errno or an NVMe
status code (which might include the DNR bit).
Cheers,
Hannes
Powered by blists - more mailing lists