netdev - Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d29b8606b290a7edaeb2953455b997a1f2837aa.camel@wdc.com>
Date:   Mon, 24 Oct 2022 22:07:22 +0000
From:   Kamaljit Singh <Kamaljit.Singh1@....com>
To:     "edumazet@...gle.com" <edumazet@...gle.com>
CC:     "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
        Niklas Cassel <Niklas.Cassel@....com>,
        Damien Le Moal <Damien.LeMoal@....com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

Hi Eric,

Please find my inline responses below.

Thanks,
Kamaljit


On Thu, 2022-10-20 at 19:48 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
> 
> 
> On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@....com>
> wrote:
> > On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > > CAUTION: This email originated from outside of Western Digital. Do not
> > > click
> > > on links or open attachments unless you recognize the sender and know that
> > > the
> > > content is safe.
> > > 
> > > 
> > > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> > > wrote:
> > > > Under certain congestion conditions, an NVMe/TCP target may be
> > > > configured
> > > > to shrink the TCP window in an effort to slow the sender down prior to
> > > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > > standard discourages implementations from shrinking the TCP window, it
> > > > also
> > > > states that TCP implementations must be robust to this occurring. The
> > > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver)
> > > > has
> > > > an issue when the TCP window is shrunk by a target, which causes ACK
> > > > frames
> > > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > > retransmitted by the host.
> > > 
> > > Linux sends ACK packets, with a legal SEQ number.
> > > 
> > > The issue is the receiver of such packets, right ?
> > Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> > initiator has an incorrect SEQ-NUM.
> > 
> > I've attached a .pcapng Network trace for Wireshark. This captures a small
> > snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using
> > fio).
> > As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12
> > from
> > the initiator. This happens right after the target closes the TCP window
> > (pkts
> > #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt
> > #11.
> > 
> > This patch addresses the above scenario (tp->snd_wnd=0) and returns the
> > correct
> > SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path
> > was
> > returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
> > 
> > Initiator Environment:
> > - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> > - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> > - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
> > 
> > 
> > > Because as you said receivers should be relaxed about this, especially
> > > if _they_ decided
> > > to not respect the TCP standards.
> > > 
> > > You are proposing to send old ACK, that might be dropped by other stacks.
> > On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> > > snd_nxt.
> 
> Please take a look at the very lengthy comment at the front of the function.
> 
> Basically we are in a mode where a value needs to be chosen, and we do
> not really know which one
> will be accepted by the buggy peer.
> 
I'm pasting the source code comment you're referring to here. You're right that
the comment is very relevant in this case as the TCP window is being shrunk,
although, I'd politely argue that its a design choice rather than a bug in our
hardware target implementation.

/* SND.NXT, if window was not shrunk or the amount of shrunk was less than one
 * window scaling factor due to loss of precision.
 * If window has been shrunk, what should we make? It is not clear at all.
 * Using SND.UNA we will fail to open window, SND.NXT is out of window. :-(
 * Anything in between SND.UNA...SND.UNA+SND.WND also can be already
 * invalid. OK, let's make this for now:
 */

Below, I'm also pasting a plain-text version of the .pcapng, provided earlier as
an email attachment. Hopefully this makes it easier to refer to the packets as
you read through my comments. I had to massage the formatting to fit it in this
email. Data remains the same except for AckNum for pkt#3 which referred to an
older packet and threw off the formatting.

Initiator = 10.10.11.151 (aka NVMe/TCP host)
Target = 10.10.11.12

No. Time        Src IP          Proto    Len    SeqNum  AckNum  WinSize
1   0.000000000 10.10.11.151    TCP      4154   1       1       25
2   0.000000668 10.10.11.151    TCP      4154   4097    1       25
3   0.000039250 10.10.11.12     TCP      64     1       x       16384   
4   0.000040064 10.10.11.12     TCP      64     1       1       16384
5   0.000040951 10.10.11.12     NVMe/TCP 82     1       1       16384
6   0.000041009 10.10.11.12     NVMe/TCP 82     25      1       16384
7   0.000059422 10.10.11.12     TCP      64     49      4097    0
8   0.000060059 10.10.11.12     TCP      64     49      8193    0 
9   0.000072519 10.10.11.12     TCP      64     49      8193    16384 
10  0.000074756 10.10.11.151    TCP      4154   8193    1       25 
11  0.000075089 10.10.11.151    TCP      802    12289   1       25 
12  0.000089454 10.10.11.151    TCP      64     4097    49      25 
13  0.000102225 10.10.11.151    TCP      4154   13033   49      25 
14  0.000102567 10.10.11.151    TCP      4154   17129   49      25 
15  0.000140273 10.10.11.12     TCP      64     49      13033   16384 
16  0.000157344 10.10.11.151    TCP      106    21225   49      25 
17  0.000158580 10.10.11.12     TCP      64     49      13033   0

Packets #7 and #8: Target shrinks window to zero for congestion control
Packet #9: ~12us later Target expands window back to 16384

[Packet #12] is an ACK packet from the Initiator. Since it does not send data,
window shrinking should not affect its SEQ-NUM here. Hence, its probably safe to
send SND.NXT, as in my patch. In other words, TCP window should be relevant to
data packets and not to ACK packets. Would you agree?

[Referring to the SND pointers diagram at this URL 
http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-2.htm]

This unexpected behavior by the Initiator causes our hardware offloaded Target
to hand-off control to Firmware slow-path. If we can keep everything in the
hardware (fast) path and not invoke Firmware that's when we have the best chance
of optimal performance.

Running heavy workloads at 100G link rate there can be a million instances of
such behavior as packet #12 is exhibiting and is very disruptive to fast-path.


> You are changing something that has been there forever, risking
> breaking many other stacks, and/or middleboxes.
> 
Regardless of how we handle this patch, the fact remains that for any other
hardware based TCP offloads existing elsewhere they will have the same
susceptibility as a result of this Linux TCP behavior, even if their congestion
control mechanism does not match this scenario.

Being fully aware of how ubiquitous TCP layer is we tried ways to avoid changing
it. Early on, in drivers/nvme/host/tcp.c we had even tried
tcp_sock_set_quickack() instead of PATCH 2/2 but it did not help. If you can
suggest a better way that could de-risk existing deployments, I'd be more than
happy to discuss this further and try other solutions. An example could be a TCP
socket option that would stay disabled by default. We could then use it in the NVMe/TCP driver or a userspace accessible method.

> It seems the remote TCP stack is quite buggy, I do not think we want
> to change something which has never been an issue until today in 2022.
> 
IMHO, I wouldn't quite characterize it that way. It was a design choice and
provides one of multiple ways of handling network congestion. It may also be
possible that there are other implementations affected by this issue.


> Also not that packet #13, sent immediately after the ACK is carrying
> whatever needed values.
> I do not see why the prior packet (#12) would matter.
> 
> Please elaborate.
Packet #13, however, is a data packet sent by the Initiator. This is in direct
reaction to packet #9 from the Target that expanded the window size back to 16K.
Even though it correctly uses SND.NXT it does not affect the handling for packet
#12 in our hardware Target.


> 
> 
> > > It has been observed that processing of these
> > > > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > > > performance.
> > > > 
> > > > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > > > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > > > value corresponding to the next SEQ-NUM expected on that connection.
> > > > 
> > > > In such a case, the kernel was using the seq number calculated by
> > > > tcp_wnd_end() as per the code segment below. Since, in this case
> > > > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect
> > > > for
> > > > the scenario.  The correct seq number that needs to be returned is
> > > > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > > > with its performance impact.
> > > > 
> > > >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> > > >   1272 {
> > > >   1273   return tp->snd_una + tp->snd_wnd;
> > > >   1274 }
> > > > 
> > > > Signed-off-by: Kamaljit Singh <kamaljit.singh1@....com>
> > > > ---
> > > >  net/ipv4/tcp_output.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > index 11aa0ab10bba..322e061edb72 100644
> > > > --- a/net/ipv4/tcp_output.c
> > > > +++ b/net/ipv4/tcp_output.c
> > > > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct
> > > > sock
> > > > *sk)
> > > >             (tp->rx_opt.wscale_ok &&
> > > >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > > > > rx_opt.rcv_wscale))))
> > > >                 return tp->snd_nxt;
> > > > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > > > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT |
> > > > TCPF_SYN_RECV)))
> > > > +               return tp->snd_nxt;
> > > >         else
> > > >                 return tcp_wnd_end(tp);
> > > >  }
> > > > --
> > > > 2.25.1
> > > > 
> > --
> > Thanks,
> > Kamaljit Singh <kamaljit.singh1@....com>