lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250417014642.ljhowsbvkn2waipw@xldev1604-tmpl.dev.purestorage.com>
Date: Wed, 16 Apr 2025 19:46:42 -0600
From: Michael Liang <mliang@...estorage.com>
To: Randy Jennings <randyj@...estorage.com>
Cc: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>,
	Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
	Mohamed Khalfella <mkhalfella@...estorage.com>,
	linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop

On Tue, Apr 08, 2025 at 02:07:24PM -0700, Randy Jennings wrote:
> On Fri, Apr 4, 2025 at 10:49 PM Michael Liang <mliang@...estorage.com> wrote:
> >
> > This patch addresses a data corruption issue observed in nvme-tcp during
> > testing.
> >
> > Issue description:
> > In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> > I/Os are canceled almost immediately after the kernel socket is shut down.
> > These canceled I/Os are reported as host path errors, triggering a failover
> > that succeeds on a different path.
> >
> > However, at this point, the original I/O may still be outstanding in the
> > host's network transmission path (e.g., the NIC’s TX queue). From the
> > user-space app's perspective, the buffer associated with the I/O is considered
> > completed since they're acked on the different path and may be reused for new
> > I/O requests.
> >
> > Because nvme-tcp enables zero-copy by default in the transmission path,
> > this can lead to corrupted data being sent to the original target, ultimately
> > causing data corruption.
> >
> > We can reproduce this data corruption by injecting delay on one path and
> > triggering i/o timeout.
> >
> > To prevent this issue, this change ensures that all inflight transmissions are
> > fully completed from host's perspective before returning from queue stop.
> > This aligns with the behavior of queue stopping in other NVMe fabric transports.
> >
> > Reviewed-by: Mohamed Khalfella <mkhalfella@...estorage.com>
> > Reviewed-by: Randy Jennings <randyj@...estorage.com>
> > Signed-off-by: Michael Liang <mliang@...estorage.com>
> 
> Through additional testing, we have recreated the corruption with this
> patch.  We had a previous iteration of the patch that ran some time
> without the corruption, and we convinced ourselves internally that a
> portion of that version should not be needed.  So, unfortunately, it
> looks like this patch is not sufficient to prevent the data
> corruption.  We do believe the issue is still with the zero-copy and
> too-quick retransmission (our tests showed that data that was only in
> the buffer while userspace controlled the buffer was transmitted on
> the wire), but we are still investigating.
> 
> Sincerely,
> Randy Jennings
We identified what was missing in the patch. In cases where concurrent I/O
timeouts occur on multiple namespaces under the same controller, there's a
race condition when stopping the same I/O queue. Due to the current patch's
queue liveness check, only the first timeout handler waits, while the others
roceed to fail I/Os immediately.

To address this race, we've modified the logic to always wait during queue
stop, regardless of the queue state. This resolves the issue. We'll post an
updated patch shortly.

Thanks,
Michael Liang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ