[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250417002936.7ezg2dwm44l7xblm@xldev1604-tmpl.dev.purestorage.com>
Date: Wed, 16 Apr 2025 18:29:36 -0600
From: Michael Liang <mliang@...estorage.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>,
Christoph Hellwig <hch@....de>,
Mohamed Khalfella <mkhalfella@...estorage.com>,
Randy Jennings <randyj@...estorage.com>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop
On Mon, Apr 14, 2025 at 01:25:05AM +0300, Sagi Grimberg wrote:
>
>
> On 05/04/2025 8:48, Michael Liang wrote:
> > This patch addresses a data corruption issue observed in nvme-tcp during
> > testing.
> >
> > Issue description:
> > In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
> > I/Os are canceled almost immediately after the kernel socket is shut down.
> > These canceled I/Os are reported as host path errors, triggering a failover
> > that succeeds on a different path.
> >
> > However, at this point, the original I/O may still be outstanding in the
> > host's network transmission path (e.g., the NIC’s TX queue). From the
> > user-space app's perspective, the buffer associated with the I/O is considered
> > completed since they're acked on the different path and may be reused for new
> > I/O requests.
> >
> > Because nvme-tcp enables zero-copy by default in the transmission path,
> > this can lead to corrupted data being sent to the original target, ultimately
> > causing data corruption.
>
> This is unexpected.
>
> 1. before retrying the command, the host shuts down the socket.
> 2. the host sets sk_lingerime to 0, which means that
> as soon as the socket is shutdown - the packet should not be able to
> transmit again
> on the socket, zero-copy or not. Perhaps there is something not handled
> correctly
> with linger=0? perhaps you should try with linger=<some-timeout> ?
I did notice that the linger time is explicitly set to 0 in nvme-tcp, but it
doesn't behave as expected in this case for two main reasons:
1. We're invoking socket shutdown, not socket close. During shutdown, tcp_shutdown()
in net/ipv4/tcp.c is called. This changes the socket state and may send a FIN
if needed, however it doesn't consider the linger setting at all;
2. Further more while tcp_close() does check the linger time, we experimented with
using socket close instead of shutdown, yet the same data corruption issue persisted.
The root cause is that once data is handed off to the lower-level device driver for
transmission, neither socket shutdown nor close can cancel it. With further tracing,
I realized that the socket could be freed a while after close when the outstanding
TX skb is released by the NIC. And sk_wmem_alloc is the one used to track the outstanding
data sent to low-level driver, so it's necessary to wait until it turns to 0 in queue
stop.
Powered by blists - more mailing lists