netdev - Re: [PATCH net 2/2] tcp: fix delayed ACKs for MSS boundary condition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMaK5_gz=B5wJhaC5MtgwiQi9Tm8fkhLdiWQLz9DX+jf0S7P=Q@mail.gmail.com>
Date: Thu, 28 Sep 2023 16:53:38 +0800
From: Xin Guo <guoxin0309@...il.com>
To: Neal Cardwell <ncardwell.sw@...il.com>
Cc: David Miller <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org, 
	Neal Cardwell <ncardwell@...gle.com>, Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net 2/2] tcp: fix delayed ACKs for MSS boundary condition

Hi Neal:
Cannot understand "if an app reads > 1*MSS data" , " If an app reads <
1*MSS data" and " if an app reads exactly 1*MSS of data" in the commit
message.
In my view, it should be like:"if an app reads and received data > 1*MSS",
" If an app reads and received data < 1*MSS" and " if an app reads and
received data exactly 1*MSS".

Regards
Guo Xin

Neal Cardwell <ncardwell.sw@...il.com> 于2023年9月27日周三 23:15写道：
>
> From: Neal Cardwell <ncardwell@...gle.com>
>
> This commit fixes poor delayed ACK behavior that can cause poor TCP
> latency in a particular boundary condition: when an application makes
> a TCP socket write that is an exact multiple of the MSS size.
>
> The problem is that there is painful boundary discontinuity in the
> current delayed ACK behavior. With the current delayed ACK behavior,
> we have:
>
> (1) If an app reads > 1*MSS data, tcp_cleanup_rbuf() ACKs immediately
>     because of:
>
>      tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
>
> (2) If an app reads < 1*MSS data and either (a) app is not ping-pong or
>     (b) we received two packets <1*MSS, then tcp_cleanup_rbuf() ACKs
>     immediately beecause of:
>
>      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
>       ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
>        !inet_csk_in_pingpong_mode(sk))) &&
>
> (3) *However*: if an app reads exactly 1*MSS of data,
>     tcp_cleanup_rbuf() does not send an immediate ACK. This is true
>     even if the app is not ping-pong and the 1*MSS of data had the PSH
>     bit set, suggesting the sending application completed an
>     application write.
>
> Thus if the app is not ping-pong, we have this painful case where
> >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
> write whose last skb is an exact multiple of 1*MSS can get a 40ms
> delayed ACK. This means that any app that transfers data in one
> direction and takes care to align write size or packet size with MSS
> can suffer this problem. With receive zero copy making 4KB MSS values
> more common, it is becoming more common to have application writes
> naturally align with MSS, and more applications are likely to
> encounter this delayed ACK problem.
>
> The fix in this commit is to refine the delayed ACK heuristics with a
> simple check: immediately ACK a received 1*MSS skb with PSH bit set if
> the app reads all data. Why? If an skb has a len of exactly 1*MSS and
> has the PSH bit set then it is likely the end of an application
> write. So more data may not be arriving soon, and yet the data sender
> may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
> set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
> an ACK immediately if the app reads all of the data and is not
> ping-pong. Note that this logic is also executed for the case where
> len > MSS, but in that case this logic does not matter (and does not
> hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
> app reads data and there is more than an MSS of unACKed data.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
> Reviewed-by: Yuchung Cheng <ycheng@...gle.com>
> Reviewed-by: Eric Dumazet <edumazet@...gle.com>
> ---
>  net/ipv4/tcp_input.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 06fe1cf645d5a..8afb0950a6979 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -253,6 +253,19 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
>                 if (unlikely(len > icsk->icsk_ack.rcv_mss +
>                                    MAX_TCP_OPTION_SPACE))
>                         tcp_gro_dev_warn(sk, skb, len);
> +               /* If the skb has a len of exactly 1*MSS and has the PSH bit
> +                * set then it is likely the end of an application write. So
> +                * more data may not be arriving soon, and yet the data sender
> +                * may be waiting for an ACK if cwnd-bound or using TX zero
> +                * copy. So we set ICSK_ACK_PUSHED here so that
> +                * tcp_cleanup_rbuf() will send an ACK immediately if the app
> +                * reads all of the data and is not ping-pong. If len > MSS
> +                * then this logic does not matter (and does not hurt) because
> +                * tcp_cleanup_rbuf() will always ACK immediately if the app
> +                * reads data and there is more than an MSS of unACKed data.
> +                */
> +               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_PSH)
> +                       icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
>         } else {
>                 /* Otherwise, we make more careful check taking into account,
>                  * that SACKs block is variable.
> --
> 2.42.0.515.g380fc7ccd1-goog
>
>