linux-kernel - Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <424fe473-8472-42b8-99c7-8b0534771343@flourine.local>
Date: Tue, 15 Apr 2025 14:34:08 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Jiewei Ke <jiewei_ke@....com>
Cc: wagi@...nel.org, hare@...e.de, hch@....de, jmeneghi@...hat.com, 
	kbusch@...nel.org, linux-kernel@...r.kernel.org, linux-nvme@...ts.infradead.org, 
	mkhalfella@...estorage.com, randyj@...estorage.com, sagi@...mberg.me
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

Hi Jiewei,

On Thu, Apr 10, 2025 at 11:44:13PM +0800, Jiewei Ke wrote:
> I just noticed that your patchset addresses a similar issue to the one I ’m
> trying to solve with my recently submitted patchset [1]. Compared to your
> approach, mine differs in a few key aspects:
> 
> 1. Only aborted requests are delayed for retry. In the current implementation,
> nvmf_complete_timed_out_request and nvme_cancel_request set the request status
> to NVME_SC_HOST_ABORTED_CMD. These requests are usually already sent to the
> target, but may have timed out or been canceled before a response is received.
> Since the target may still be processing them, the host needs to delay retrying
> to ensure the target has completed or cleaned up the stale requests. On the
> other hand, requests that are not aborted — such as those that never got
> submitted due to no usable path (e.g., from nvme_ns_head_submit_bio), or those
> that already received an ANA error from the target — do not need
> delayed retry.

If I understand you correctly, you are concerned about delaying all
commands even these which are not transmitted yet. If there are no
garantees on ordering it would be possible to failover these commands
immediately. Sure something which could be improved in my series.

> 2. The host explicitly disconnects and stops KeepAlive before delay scheduling
> retrying requests. This aligns with Section 9.6 "Communication Loss Handling"
> of the NVMe Base Specification 2.1. Once the host disconnects, the target may
> take up to the KATO interval to detect the lost connection and begin cleaning
> up any remaining requests. Retrying too early may still lead to data
> corruption issues.

The host error handler is executed thus all communication is stopped.
This part hasn't changed. Not sure what you are referring too. The only
thing which changes in this series is when we enqueue the failed
commands again.

> @@ -2178,6 +2180,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
>  	nvme_quiesce_admin_queue(&ctrl->ctrl);
>  	nvme_disable_ctrl(&ctrl->ctrl, shutdown);
>  	nvme_rdma_teardown_admin_queue(ctrl, shutdown);
> +	nvme_delay_kick_retry_lists(&ctrl->ctrl); <<< delay kick retry
>  	after teardown all queues

Without the kick it hangs. The admin has explicitly removed the ctrl.

As I already said, this is a RFC for the sake to figure out if this
approach is a good idea. We all agreed, we should first try to sort this
out before introducing a new queue. There are many problems which it
will introduce, like the one from above 'why delaying not send
requests?', 'What happens when we have several short
disconnects/connects?', ...

BTW, there is already a hack in disconnect/connect state transition.
Ideally we solve this in a more generic manner.

Thanks,
Daniel