netdev - Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <29e89051d65ae93dc5515c59f56bed4e2e5d8e9f.camel@wdc.com>
Date:   Fri, 21 Oct 2022 01:01:47 +0000
From:   Kamaljit Singh <Kamaljit.Singh1@....com>
To:     "edumazet@...gle.com" <edumazet@...gle.com>
CC:     Niklas Cassel <Niklas.Cassel@....com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Damien Le Moal <Damien.LeMoal@....com>,
        "dsahern@...nel.org" <dsahern@...nel.org>,
        "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
> 
> 
> On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> wrote:
> > Under certain congestion conditions, an NVMe/TCP target may be configured
> > to shrink the TCP window in an effort to slow the sender down prior to
> > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > standard discourages implementations from shrinking the TCP window, it also
> > states that TCP implementations must be robust to this occurring. The
> > current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> > an issue when the TCP window is shrunk by a target, which causes ACK frames
> > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > retransmitted by the host.
> 
> Linux sends ACK packets, with a legal SEQ number.
> 
> The issue is the receiver of such packets, right ?
Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
initiator has an incorrect SEQ-NUM. 

I've attached a .pcapng Network trace for Wireshark. This captures a small
snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using fio).
As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12 from
the initiator. This happens right after the target closes the TCP window (pkts
#7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt #11.

This patch addresses the above scenario (tp->snd_wnd=0) and returns the correct
SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path was
returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.

Initiator Environment:
- NVMe-oF Initiator: drivers/nvme/host/tcp.c
- NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
- Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)


> 
> Because as you said receivers should be relaxed about this, especially
> if _they_ decided
> to not respect the TCP standards.
> 
> You are proposing to send old ACK, that might be dropped by other stacks.
On the contrary, I'm proposing to use the expected/correct ACK based on tp-
>snd_nxt.


> 
> It has been observed that processing of these
> > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > performance.
> > 
> > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > value corresponding to the next SEQ-NUM expected on that connection.
> > 
> > In such a case, the kernel was using the seq number calculated by
> > tcp_wnd_end() as per the code segment below. Since, in this case
> > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> > the scenario.  The correct seq number that needs to be returned is
> > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > with its performance impact.
> > 
> >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> >   1272 {
> >   1273   return tp->snd_una + tp->snd_wnd;
> >   1274 }
> > 
> > Signed-off-by: Kamaljit Singh <kamaljit.singh1@....com>
> > ---
> >  net/ipv4/tcp_output.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 11aa0ab10bba..322e061edb72 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock
> > *sk)
> >             (tp->rx_opt.wscale_ok &&
> >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > >rx_opt.rcv_wscale))))
> >                 return tp->snd_nxt;
> > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> > +               return tp->snd_nxt;
> >         else
> >                 return tcp_wnd_end(tp);
> >  }
> > --
> > 2.25.1
> > 
-- 
Thanks,
Kamaljit Singh <kamaljit.singh1@....com>

Download attachment "AckWithStaleSeqNum.pcapng" of type "application/x-pcapng" (23028 bytes)