[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iKfLGRaa+GSgaXAmroPG7fu0S_Bb0KnBUKsdqEwBjj6Aw@mail.gmail.com>
Date: Mon, 24 Oct 2022 17:21:49 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Kamaljit Singh <Kamaljit.Singh1@....com>
Cc: "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
Niklas Cassel <Niklas.Cassel@....com>,
Damien Le Moal <Damien.LeMoal@....com>,
"davem@...emloft.net" <davem@...emloft.net>,
"kuba@...nel.org" <kuba@...nel.org>,
"pabeni@...hat.com" <pabeni@...hat.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
On Mon, Oct 24, 2022 at 3:07 PM Kamaljit Singh <Kamaljit.Singh1@....com> wrote:
>
> Hi Eric,
>
> Please find my inline responses below.
>
> Thanks,
> Kamaljit
>
>
> On Thu, 2022-10-20 at 19:48 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@....com>
> > wrote:
> > > On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > > > CAUTION: This email originated from outside of Western Digital. Do not
> > > > click
> > > > on links or open attachments unless you recognize the sender and know that
> > > > the
> > > > content is safe.
> > > >
> > > >
> > > > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> > > > wrote:
> > > > > Under certain congestion conditions, an NVMe/TCP target may be
> > > > > configured
> > > > > to shrink the TCP window in an effort to slow the sender down prior to
> > > > > issuing a more drastic L2 pause or PFC indication. Although the TCP
> > > > > standard discourages implementations from shrinking the TCP window, it
> > > > > also
> > > > > states that TCP implementations must be robust to this occurring. The
> > > > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver)
> > > > > has
> > > > > an issue when the TCP window is shrunk by a target, which causes ACK
> > > > > frames
> > > > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > > > retransmitted by the host.
> > > >
> > > > Linux sends ACK packets, with a legal SEQ number.
> > > >
> > > > The issue is the receiver of such packets, right ?
> > > Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> > > initiator has an incorrect SEQ-NUM.
> > >
> > > I've attached a .pcapng Network trace for Wireshark. This captures a small
> > > snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using
> > > fio).
> > > As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12
> > > from
> > > the initiator. This happens right after the target closes the TCP window
> > > (pkts
> > > #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt
> > > #11.
> > >
> > > This patch addresses the above scenario (tp->snd_wnd=0) and returns the
> > > correct
> > > SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path
> > > was
> > > returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
> > >
> > > Initiator Environment:
> > > - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> > > - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> > > - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
> > >
> > >
> > > > Because as you said receivers should be relaxed about this, especially
> > > > if _they_ decided
> > > > to not respect the TCP standards.
> > > >
> > > > You are proposing to send old ACK, that might be dropped by other stacks.
> > > On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> > > > snd_nxt.
> >
> > Please take a look at the very lengthy comment at the front of the function.
> >
> > Basically we are in a mode where a value needs to be chosen, and we do
> > not really know which one
> > will be accepted by the buggy peer.
> >
> I'm pasting the source code comment you're referring to here. You're right that
> the comment is very relevant in this case as the TCP window is being shrunk,
> although, I'd politely argue that its a design choice rather than a bug in our
> hardware target implementation.
>
> /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
> * window scaling factor due to loss of precision.
> * If window has been shrunk, what should we make? It is not clear at all.
> * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-(
> * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
> * invalid. OK, let's make this for now:
> */
>
> Below, I'm also pasting a plain-text version of the .pcapng, provided earlier as
> an email attachment. Hopefully this makes it easier to refer to the packets as
> you read through my comments. I had to massage the formatting to fit it in this
> email. Data remains the same except for AckNum for pkt#3 which referred to an
> older packet and threw off the formatting.
>
> Initiator = 10.10.11.151 (aka NVMe/TCP host)
> Target = 10.10.11.12
>
> No. Time Src IP Proto Len SeqNum AckNum WinSize
> 1 0.000000000 10.10.11.151 TCP 4154 1 1 25
> 2 0.000000668 10.10.11.151 TCP 4154 4097 1 25
> 3 0.000039250 10.10.11.12 TCP 64 1 x 16384
> 4 0.000040064 10.10.11.12 TCP 64 1 1 16384
> 5 0.000040951 10.10.11.12 NVMe/TCP 82 1 1 16384
> 6 0.000041009 10.10.11.12 NVMe/TCP 82 25 1 16384
> 7 0.000059422 10.10.11.12 TCP 64 49 4097 0
> 8 0.000060059 10.10.11.12 TCP 64 49 8193 0
> 9 0.000072519 10.10.11.12 TCP 64 49 8193 16384
> 10 0.000074756 10.10.11.151 TCP 4154 8193 1 25
> 11 0.000075089 10.10.11.151 TCP 802 12289 1 25
> 12 0.000089454 10.10.11.151 TCP 64 4097 49 25
> 13 0.000102225 10.10.11.151 TCP 4154 13033 49 25
> 14 0.000102567 10.10.11.151 TCP 4154 17129 49 25
> 15 0.000140273 10.10.11.12 TCP 64 49 13033 16384
> 16 0.000157344 10.10.11.151 TCP 106 21225 49 25
> 17 0.000158580 10.10.11.12 TCP 64 49 13033 0
>
> Packets #7 and #8: Target shrinks window to zero for congestion control
> Packet #9: ~12us later Target expands window back to 16384
>
> [Packet #12] is an ACK packet from the Initiator. Since it does not send data,
> window shrinking should not affect its SEQ-NUM here. Hence, its probably safe to
> send SND.NXT, as in my patch. In other words, TCP window should be relevant to
> data packets and not to ACK packets. Would you agree?
No, I do not agree.
See at the end of this email a packetdrill test demonstrating your
pacth would add extra work
(a challenge ACK)
>
> [Referring to the SND pointers diagram at this URL
> http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-2.htm]
>
> This unexpected behavior by the Initiator causes our hardware offloaded Target
> to hand-off control to Firmware slow-path.
This is the bug.
This packet is perfectly normal and should not cause a problem with an offload.
Please contact the vendor to fix this issue.
If we can keep everything in the
> hardware (fast) path and not invoke Firmware that's when we have the best chance
> of optimal performance.
>
> Running heavy workloads at 100G link rate there can be a million instances of
> such behavior as packet #12 is exhibiting and is very disruptive to fast-path.
>
>
> > You are changing something that has been there forever, risking
> > breaking many other stacks, and/or middleboxes.
> >
> Regardless of how we handle this patch, the fact remains that for any other
> hardware based TCP offloads existing elsewhere they will have the same
> susceptibility as a result of this Linux TCP behavior, even if their congestion
> control mechanism does not match this scenario.
>
> Being fully aware of how ubiquitous TCP layer is we tried ways to avoid changing
> it. Early on, in drivers/nvme/host/tcp.c we had even tried
> tcp_sock_set_quickack() instead of PATCH 2/2 but it did not help. If you can
> suggest a better way that could de-risk existing deployments, I'd be more than
> happy to discuss this further and try other solutions. An example could be a TCP
> socket option that would stay disabled by default. We could then use it in the NVMe/TCP driver or a userspace accessible method.
>
> > It seems the remote TCP stack is quite buggy, I do not think we want
> > to change something which has never been an issue until today in 2022.
> >
> IMHO, I wouldn't quite characterize it that way. It was a design choice and
> provides one of multiple ways of handling network congestion. It may also be
> possible that there are other implementations affected by this issue.
>
>
> > Also not that packet #13, sent immediately after the ACK is carrying
> > whatever needed values.
> > I do not see why the prior packet (#12) would matter.
> >
> > Please elaborate.
> Packet #13, however, is a data packet sent by the Initiator. This is in direct
> reaction to packet #9 from the Target that expanded the window size back to 16K.
> Even though it correctly uses SND.NXT it does not affect the handling for packet
> #12 in our hardware Target.
>
>
Here is a packetdrill test showing the problem you are adding, since
it seems it is not clear to you.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [16384], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 2920 <mss 4096,sackOK,nop,nop>
+0 > S. 0:0(0) ack 1 <mss 4096,nop,nop,sackOK>
+.1 < . 1:1(0) ack 1 win 16384
+0 accept(3, ..., ...) = 4
// Note: linux will not shrink the window...
// I think this would require some patch, to emulate a buggy stack
+0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1000], 4) = 0
+0 < P. 1:16385(16384) ack 1 win 150
+0 write(4, ..., 100) = 100
+0 > P. 1:101(100) ack 16385 win 0
// OK, what if the ACK carries a sequence in the future ?
// This could happen if the peer sent 18000 bytes while our window was
> 16384 and
// if kamaljit.singh1@....com patch would be accepted...
+.1 < . 18001:18001(0) ack 101 win 1000
// Too bad, prior ACK has a sequence in the future
// We send a challenge ACK in an attempt to fix the synchronization issue.
// This would be avoided completely if prior ack was "16385:16385(0)
ack 101 win 1000"
+0 > . 101:101(0) ack 16385 win 0
Powered by blists - more mailing lists