netdev - Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iKfLGRaa+GSgaXAmroPG7fu0S_Bb0KnBUKsdqEwBjj6Aw@mail.gmail.com>
Date:   Mon, 24 Oct 2022 17:21:49 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Kamaljit Singh <Kamaljit.Singh1@....com>
Cc:     "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
        Niklas Cassel <Niklas.Cassel@....com>,
        Damien Le Moal <Damien.LeMoal@....com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

On Mon, Oct 24, 2022 at 3:07 PM Kamaljit Singh <Kamaljit.Singh1@....com> wrote:
>
> Hi Eric,
>
> Please find my inline responses below.
>
> Thanks,
> Kamaljit
>
>
> On Thu, 2022-10-20 at 19:48 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@....com>
> > wrote:
> > > On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > > > CAUTION: This email originated from outside of Western Digital. Do not
> > > > click
> > > > on links or open attachments unless you recognize the sender and know that
> > > > the
> > > > content is safe.
> > > >
> > > >
> > > > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> > > > wrote:
> > > > > Under certain congestion conditions, an NVMe/TCP target may be
> > > > > configured
> > > > > to shrink the TCP window in an effort to slow the sender down prior to
> > > > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > > > standard discourages implementations from shrinking the TCP window, it
> > > > > also
> > > > > states that TCP implementations must be robust to this occurring. The
> > > > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver)
> > > > > has
> > > > > an issue when the TCP window is shrunk by a target, which causes ACK
> > > > > frames
> > > > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > > > retransmitted by the host.
> > > >
> > > > Linux sends ACK packets, with a legal SEQ number.
> > > >
> > > > The issue is the receiver of such packets, right ?
> > > Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> > > initiator has an incorrect SEQ-NUM.
> > >
> > > I've attached a .pcapng Network trace for Wireshark. This captures a small
> > > snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using
> > > fio).
> > > As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12
> > > from
> > > the initiator. This happens right after the target closes the TCP window
> > > (pkts
> > > #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt
> > > #11.
> > >
> > > This patch addresses the above scenario (tp->snd_wnd=0) and returns the
> > > correct
> > > SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path
> > > was
> > > returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
> > >
> > > Initiator Environment:
> > > - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> > > - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> > > - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
> > >
> > >
> > > > Because as you said receivers should be relaxed about this, especially
> > > > if _they_ decided
> > > > to not respect the TCP standards.
> > > >
> > > > You are proposing to send old ACK, that might be dropped by other stacks.
> > > On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> > > > snd_nxt.
> >
> > Please take a look at the very lengthy comment at the front of the function.
> >
> > Basically we are in a mode where a value needs to be chosen, and we do
> > not really know which one
> > will be accepted by the buggy peer.
> >
> I'm pasting the source code comment you're referring to here. You're right that
> the comment is very relevant in this case as the TCP window is being shrunk,
> although, I'd politely argue that its a design choice rather than a bug in our
> hardware target implementation.
>
> /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
>  * window scaling factor due to loss of precision.
>  * If window has been shrunk, what should we make? It is not clear at all.
>  * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-(
>  * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
>  * invalid. OK, let's make this for now:
>  */
>
> Below, I'm also pasting a plain-text version of the .pcapng, provided earlier as
> an email attachment. Hopefully this makes it easier to refer to the packets as
> you read through my comments. I had to massage the formatting to fit it in this
> email. Data remains the same except for AckNum for pkt#3 which referred to an
> older packet and threw off the formatting.
>
> Initiator = 10.10.11.151 (aka NVMe/TCP host)
> Target = 10.10.11.12
>
> No. Time        Src IP          Proto    Len    SeqNum  AckNum  WinSize
> 1   0.000000000 10.10.11.151    TCP      4154   1       1       25
> 2   0.000000668 10.10.11.151    TCP      4154   4097    1       25
> 3   0.000039250 10.10.11.12     TCP      64     1       x       16384
> 4   0.000040064 10.10.11.12     TCP      64     1       1       16384
> 5   0.000040951 10.10.11.12     NVMe/TCP 82     1       1       16384
> 6   0.000041009 10.10.11.12     NVMe/TCP 82     25      1       16384
> 7   0.000059422 10.10.11.12     TCP      64     49      4097    0
> 8   0.000060059 10.10.11.12     TCP      64     49      8193    0
> 9   0.000072519 10.10.11.12     TCP      64     49      8193    16384
> 10  0.000074756 10.10.11.151    TCP      4154   8193    1       25
> 11  0.000075089 10.10.11.151    TCP      802    12289   1       25
> 12  0.000089454 10.10.11.151    TCP      64     4097    49      25
> 13  0.000102225 10.10.11.151    TCP      4154   13033   49      25
> 14  0.000102567 10.10.11.151    TCP      4154   17129   49      25
> 15  0.000140273 10.10.11.12     TCP      64     49      13033   16384
> 16  0.000157344 10.10.11.151    TCP      106    21225   49      25
> 17  0.000158580 10.10.11.12     TCP      64     49      13033   0
>
> Packets #7 and #8: Target shrinks window to zero for congestion control
> Packet #9: ~12us later Target expands window back to 16384
>
> [Packet #12] is an ACK packet from the Initiator. Since it does not send data,
> window shrinking should not affect its SEQ-NUM here. Hence, its probably safe to
> send SND.NXT, as in my patch. In other words, TCP window should be relevant to
> data packets and not to ACK packets. Would you agree?

No, I do not agree.

See at the end of this email a packetdrill test demonstrating your
pacth would add extra work
(a challenge ACK)

>
> [Referring to the SND pointers diagram at this URL
> http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-2.htm]
>
> This unexpected behavior by the Initiator causes our hardware offloaded Target
> to hand-off control to Firmware slow-path.

This is the bug.

This packet is perfectly normal and should not cause a problem with an offload.

Please contact the vendor to fix this issue.

 If we can keep everything in the
> hardware (fast) path and not invoke Firmware that's when we have the best chance
> of optimal performance.
>
> Running heavy workloads at 100G link rate there can be a million instances of
> such behavior as packet #12 is exhibiting and is very disruptive to fast-path.
>
>
> > You are changing something that has been there forever, risking
> > breaking many other stacks, and/or middleboxes.
> >
> Regardless of how we handle this patch, the fact remains that for any other
> hardware based TCP offloads existing elsewhere they will have the same
> susceptibility as a result of this Linux TCP behavior, even if their congestion
> control mechanism does not match this scenario.
>
> Being fully aware of how ubiquitous TCP layer is we tried ways to avoid changing
> it. Early on, in drivers/nvme/host/tcp.c we had even tried
> tcp_sock_set_quickack() instead of PATCH 2/2 but it did not help. If you can
> suggest a better way that could de-risk existing deployments, I'd be more than
> happy to discuss this further and try other solutions. An example could be a TCP
> socket option that would stay disabled by default. We could then use it in the NVMe/TCP driver or a userspace accessible method.
>
> > It seems the remote TCP stack is quite buggy, I do not think we want
> > to change something which has never been an issue until today in 2022.
> >
> IMHO, I wouldn't quite characterize it that way. It was a design choice and
> provides one of multiple ways of handling network congestion. It may also be
> possible that there are other implementations affected by this issue.
>
>
> > Also not that packet #13, sent immediately after the ACK is carrying
> > whatever needed values.
> > I do not see why the prior packet (#12) would matter.
> >
> > Please elaborate.
> Packet #13, however, is a data packet sent by the Initiator. This is in direct
> reaction to packet #9 from the Target that expanded the window size back to 16K.
> Even though it correctly uses SND.NXT it does not affect the handling for packet
> #12 in our hardware Target.
>
>

Here is a packetdrill test showing the problem you are adding, since
it seems it is not clear to you.

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [16384], 4) = 0

   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 2920 <mss 4096,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 4096,nop,nop,sackOK>
  +.1 < . 1:1(0) ack 1 win 16384
    +0 accept(3, ..., ...) = 4

// Note: linux will not shrink the window...
// I think this would require some patch, to emulate a buggy stack
   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1000], 4) = 0

   +0 < P. 1:16385(16384) ack 1 win 150
   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 16385 win 0

// OK, what if the ACK carries a sequence in the future ?
// This could happen if the peer sent 18000 bytes while our window was
> 16384 and
// if kamaljit.singh1@....com  patch would be accepted...

   +.1 < . 18001:18001(0) ack 101 win 1000

// Too bad, prior ACK has a sequence in the future
// We send a challenge ACK in an attempt to fix the synchronization issue.
// This would be avoided completely if prior ack was "16385:16385(0)
ack 101 win 1000"
  +0  > . 101:101(0) ack 16385 win 0