[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dbdce29-621a-4286-a832-2918ec1794e4@grimberg.me>
Date: Thu, 11 Apr 2024 11:37:59 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: Hannes Reinecke <hare@...e.de>, Daniel Wagner <dwagner@...e.de>
Cc: Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>,
James Smart <james.smart@...adcom.com>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 1/6] nvme: authentication error are always
non-retryable
On 11/04/2024 10:11, Hannes Reinecke wrote:
> On 4/10/24 15:50, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 15:05, Hannes Reinecke wrote:
>>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>>> From: Hannes Reinecke <hare@...e.de>
>>>>>>>
>>>>>>> Any authentication errors which are generated internally are always
>>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>>> retried.
>>>>>> The patch title says that any authentication error is not
>>>>>> retryable, and
>>>>>> the patch body says "authentication errors which are generated
>>>>>> locally
>>>>>> are non-retryable" so which one is it?
>>>>> Forgot to update the commit message. What about:
>>>>>
>>>>> All authentication errors are non-retryable, so use negative error
>>>>> codes to ensure they are not retried.
>>>>>
>>>>> ?
>>>>
>>>> I have a question, what happens if nvmet updated its credentials
>>>> (by the admin) and in the period until the host got his credentials
>>>> updated, it
>>>> happens to disconnect/reconnect. It will see an authentication
>>>> error, so it will not retry and remove the controller altogether?
>>>>
>>>> Sounds like an issue to me.
>>>
>>> Usual thing: we cannot differentiate (on the host side) whether the
>>> current PSK is _about_ to be replaced; how should the kernel
>>> know that the admin will replace the PSK in the next minutes?
>>>
>>> But that really is an issue with the standard. Currently there is no
>>> way how a target could inform the initiator that the credentials have
>>> been updated.
>>
>> I'd say that the sane thing for the host to do in this case is to
>> reconnect
>> until giving up with hope that it may work. This seems like a better
>> approach
>> than to abruptly remove the controller no?
>>
>>>
>>> We would need to define a new status code for this.
>>> In the meantime the safe operations model is to set a lifetime
>>> for each PSK, and ensure that the PSK is updated on both sides
>>> during the lifetime. With that there is a timeframe during which
>>> both PSKs are available (on the target), and the older will expire
>>> automatically once the lifetime limit is reached.
>>
>> That is a good solution, and will also prevent a loss of service until
>> the host credentials are updated as well.
>>
>> But regardless I have a feeling that simply removing the controller upon
>> an authentication error is not the right thing to do here.
>
> Guess what; that's what I tried to do initially. But then Christoph
> objected that we shouldn't generate NVMe status codes internally.
> But if we can't do that then we'll have to invent yet another way to
> return a retryable error, leading to even more band-aid.
> So I am not quite sure how we could achieve that, short of making
> _every_ error retryable...
So this whole thing is that you want to make the host to not reconnect
if the controller
sent a DNR and reconnect otherwise?
What are you returning today if the authentication failed? Am I reading
it right that you are
returning -ECONNREFUSED? I think that for the specific case of
credentials mismatch (that and only
that) you may want to return -EKEYREJECTED. That according to the
documentation (/* Key was rejected by service */)
is specific enough that perhaps we can treat it specially when asking
"should I reconnect?"
Thoughts?
Powered by blists - more mailing lists