[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29e89051d65ae93dc5515c59f56bed4e2e5d8e9f.camel@wdc.com>
Date: Fri, 21 Oct 2022 01:01:47 +0000
From: Kamaljit Singh <Kamaljit.Singh1@....com>
To: "edumazet@...gle.com" <edumazet@...gle.com>
CC: Niklas Cassel <Niklas.Cassel@....com>,
"davem@...emloft.net" <davem@...emloft.net>,
Damien Le Moal <Damien.LeMoal@....com>,
"dsahern@...nel.org" <dsahern@...nel.org>,
"yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
"kuba@...nel.org" <kuba@...nel.org>,
"pabeni@...hat.com" <pabeni@...hat.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 1/2] tcp: Fix for stale host ACK when tgt window shrunk
On Thu, 2022-10-20 at 13:45 -0700, Eric Dumazet wrote:
> CAUTION: This email originated from outside of Western Digital. Do not click
> on links or open attachments unless you recognize the sender and know that the
> content is safe.
>
>
> On Thu, Oct 20, 2022 at 11:22 AM Kamaljit Singh <kamaljit.singh1@....com>
> wrote:
> > Under certain congestion conditions, an NVMe/TCP target may be configured
> > to shrink the TCP window in an effort to slow the sender down prior to
> > issuing a more drastic L2 pause or PFC indication. Although the TCP
> > standard discourages implementations from shrinking the TCP window, it also
> > states that TCP implementations must be robust to this occurring. The
> > current Linux TCP layer (in conjunction with the NVMe/TCP host driver) has
> > an issue when the TCP window is shrunk by a target, which causes ACK frames
> > to be transmitted with a “stale” SEQ_NUM or for data frames to be
> > retransmitted by the host.
>
> Linux sends ACK packets, with a legal SEQ number.
>
> The issue is the receiver of such packets, right ?
Not exactly. In certain conditions the ACK pkt being sent by the NVMe/TCP
initiator has an incorrect SEQ-NUM.
I've attached a .pcapng Network trace for Wireshark. This captures a small
snippet of 4K Writes from 10.10.11.151 to a target at 10.10.11.12 (using fio).
As you see pkt #2 shows a SEQ-NUM 4097, which is repeated in ACK pkt #12 from
the initiator. This happens right after the target closes the TCP window (pkts
#7, #8). Pkt #12 should've used a SEQ-NUM of 13033 in continuation from pkt #11.
This patch addresses the above scenario (tp->snd_wnd=0) and returns the correct
SEQ-NUM that is based on tp->snd_nxt. Without this patch the last else path was
returning tcp_wnd_end(tp), which sent the stale SEQ-NUM.
Initiator Environment:
- NVMe-oF Initiator: drivers/nvme/host/tcp.c
- NIC driver: mlx5_core (Mellanox, 100G), IP addr 10.10.11.151
- Ubuntu 20.04 LTS, Kernel 5.19.0-rc7 (with above patches 1 & 2 only)
>
> Because as you said receivers should be relaxed about this, especially
> if _they_ decided
> to not respect the TCP standards.
>
> You are proposing to send old ACK, that might be dropped by other stacks.
On the contrary, I'm proposing to use the expected/correct ACK based on tp-
>snd_nxt.
>
> It has been observed that processing of these
> > “stale” ACKs or data retransmissions impacts NVMe/TCP Write IOPs
> > performance.
> >
> > Network traffic analysis revealed that SEQ-NUM being used by the host to
> > ACK the frame that resized the TCP window had an older SEQ-NUM and not a
> > value corresponding to the next SEQ-NUM expected on that connection.
> >
> > In such a case, the kernel was using the seq number calculated by
> > tcp_wnd_end() as per the code segment below. Since, in this case
> > tp->snd_wnd=0, tcp_wnd_end(tp) returns tp->snd_una, which is incorrect for
> > the scenario. The correct seq number that needs to be returned is
> > tp->snd_nxt. This fix seems to have fixed the stale SEQ-NUM issue along
> > with its performance impact.
> >
> > 1271 static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
> > 1272 {
> > 1273 return tp->snd_una + tp->snd_wnd;
> > 1274 }
> >
> > Signed-off-by: Kamaljit Singh <kamaljit.singh1@....com>
> > ---
> > net/ipv4/tcp_output.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 11aa0ab10bba..322e061edb72 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -100,6 +100,9 @@ static inline __u32 tcp_acceptable_seq(const struct sock
> > *sk)
> > (tp->rx_opt.wscale_ok &&
> > ((tp->snd_nxt - tcp_wnd_end(tp)) < (1 << tp-
> > >rx_opt.rcv_wscale))))
> > return tp->snd_nxt;
> > + else if (!tp->snd_wnd && !sock_flag(sk, SOCK_DEAD) &&
> > + !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)))
> > + return tp->snd_nxt;
> > else
> > return tcp_wnd_end(tp);
> > }
> > --
> > 2.25.1
> >
--
Thanks,
Kamaljit Singh <kamaljit.singh1@....com>
Download attachment "AckWithStaleSeqNum.pcapng" of type "application/x-pcapng" (23028 bytes)
Powered by blists - more mailing lists