lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 20 Oct 2022 19:48:36 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     Kamaljit Singh <Kamaljit.Singh1@....com>
Cc:     Niklas Cassel <Niklas.Cassel@....com>,
        "davem@...emloft.net" <davem@...emloft.net>,
        Damien Le Moal <Damien.LeMoal@....com>,
        "dsahern@...nel.org" <dsahern@...nel.org>,
        "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk

On Thu, Oct 20, 2022 at 6:01 PM Kamaljit Singh <Kamaljit.Singh1@....com> wrote:
>
> On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> > CAUTION: This email originated from outside of Western Digital. Do not click
> > on links or open attachments unless you recognize the sender and know that the
> > content is safe.
> >
> >
> > On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> > wrote:
> > > Under certain congestion conditions, an NVMe/TCP target may be configured
> > > to shrink the TCP window in an effort to slow the sender down prior to
> > > issuing a more drastic L2 pause or PFC indication.  Although the TCP
> > > standard discourages implementations from shrinking the TCP window, it also
> > > states that TCP implementations must be robust to this occurring. The
> > > current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> > > an issue when the TCP window is shrunk by a target, which causes ACK frames
> > > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > > retransmitted by the host.
> >
> > Linux sends ACK packets, with a legal SEQ number.
> >
> > The issue is the receiver of such packets, right ?
> Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
> initiator has an incorrect SEQ-NUM.
>
> I've attached a .pcapng Network trace for Wireshark. This captures a small
> snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using fio).
> As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12 from
> the initiator. This happens right after the target closes the TCP window (pkts
> #7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt #11.
>
> This patch addresses the above scenario (tp->snd_wnd=0) and returns the correct
> SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path was
> returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
>
> Initiator Environment:
> - NVMe-oF Initiator: drivers/nvme/host/tcp.c
> - NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
> - Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
>
>
> >
> > Because as you said receivers should be relaxed about this, especially
> > if _they_ decided
> > to not respect the TCP standards.
> >
> > You are proposing to send old ACK, that might be dropped by other stacks.
> On the contrary, I'm proposing to use the expected/correct ACK based on tp-
> >snd_nxt.

Please take a look at the very lengthy comment at the front of the function.

Basically we are in a mode where a value needs to be chosen, and we do
not really know which one
will be accepted by the buggy peer.

You are changing something that has been there forever, risking
breaking many other stacks, and/or middleboxes.

It seems the remote TCP stack is quite buggy, I do not think we want
to change something which has never been an issue until today in 2022.

Also not that packet #13, sent immediately after the ACK is carrying
whatever needed values.
I do not see why the prior packet (#12) would matter.

Please elaborate.



>
>
> >
> > It has been observed that processing of these
> > > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > > performance.
> > >
> > > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > > value corresponding to the next SEQ-NUM expected on that connection.
> > >
> > > In such a case, the kernel was using the seq number calculated by
> > > tcp_wnd_end() as per the code segment below. Since, in this case
> > > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> > > the scenario.  The correct seq number that needs to be returned is
> > > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > > with its performance impact.
> > >
> > >   1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> > >   1272 {
> > >   1273   return tp->snd_una + tp->snd_wnd;
> > >   1274 }
> > >
> > > Signed-off-by: Kamaljit Singh <kamaljit.singh1@....com>
> > > ---
> > >  net/ipv4/tcp_output.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 11aa0ab10bba..322e061edb72 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock
> > > *sk)
> > >             (tp->rx_opt.wscale_ok &&
> > >              ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > > >rx_opt.rcv_wscale))))
> > >                 return tp->snd_nxt;
> > > +       else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > > +                !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> > > +               return tp->snd_nxt;
> > >         else
> > >                 return tcp_wnd_end(tp);
> > >  }
> > > --
> > > 2.25.1
> > >
> --
> Thanks,
> Kamaljit Singh <kamaljit.singh1@....com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ