linux-kernel - Re: [PATCH net-next] net: tcp: Updating MSS, when the sending window is smaller than MSS.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKnT5Ebk5vovFJKHY4Fe7ERkN3ak_Nkqyc=vXL=VWvyPg@mail.gmail.com>
Date:   Tue, 8 Jun 2021 17:56:52 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Shuyi Cheng <chengshuyi@...ux.alibaba.com>
Cc:     David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-janitors@...r.kernel.org,
        Mao Wenan <wenan.mao@...ux.alibaba.com>
Subject: Re: [PATCH net-next] net: tcp: Updating MSS, when the sending window
 is smaller than MSS.

On Tue, Jun 8, 2021 at 4:26 AM Shuyi Cheng <chengshuyi@...ux.alibaba.com> wrote:
>
>
>
> On 6/7/21 7:34 PM, Eric Dumazet wrote:
> > On Mon, Jun 7, 2021 at 11:35 AM Shuyi Cheng
> > <chengshuyi@...ux.alibaba.com> wrote:
> >>
> >> When the lo network card is used for communication, the tcp server
> >> reduces the size of the receiving buffer, causing the tcp client
> >> to have a delay of 200ms. Examples are as follows:
> >>
> >> Suppose that the MTU of the lo network card is 65536, and the tcp server
> >> sets the receive buffer size to 42KB. According to the
> >> tcp_bound_to_half_wnd function, the MSS of the server and client is
> >> 21KB. Then, the tcp server sets the buffer size of the connection to
> >> 16KB. At this time, the MSS of the server is 8KB, and the MSS of the
> >> client is still 21KB. But it will cause the client to fail to send the
> >> message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test
> >> failed, and then entered the zero window detection phase, resulting in a
> >> 200ms delay.
> >>
> >> Therefore, we mainly modify two places. One is the tcp_current_mss
> >> function. When the sending window is smaller than the current mss, mss
> >> needs to be updated. The other is the tcp_bound_to_half_wnd function.
> >> When the sending window is smaller than the current mss, the mss value
> >> should be calculated according to the current sending window, not
> >> max_window.
> >>
> >> Signed-off-by: Shuyi Cheng <chengshuyi@...ux.alibaba.com>
> >> ---
> >>   include/net/tcp.h     | 11 ++++++++---
> >>   net/ipv4/tcp_output.c |  3 ++-
> >>   2 files changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/include/net/tcp.h b/include/net/tcp.h
> >> index e668f1b..fcdef16 100644
> >> --- a/include/net/tcp.h
> >> +++ b/include/net/tcp.h
> >> @@ -641,6 +641,11 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)
> >>   static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
> >>   {
> >>          int cutoff;
> >> +       int window;
> >> +
> >> +       window = tp->max_window;
> >> +       if (tp->snd_wnd && tp->snd_wnd < pktsize)
> >> +               window = tp->snd_wnd;
> >>
> >>          /* When peer uses tiny windows, there is no use in packetizing
> >>           * to sub-MSS pieces for the sake of SWS or making sure there
> >> @@ -649,10 +654,10 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
> >>           * On the other hand, for extremely large MSS devices, handling
> >>           * smaller than MSS windows in this way does make sense.
> >>           */
> >> -       if (tp->max_window > TCP_MSS_DEFAULT)
> >> -               cutoff = (tp->max_window >> 1);
> >> +       if (window > TCP_MSS_DEFAULT)
> >> +               cutoff = (window >> 1);
> >>          else
> >> -               cutoff = tp->max_window;
> >> +               cutoff = window;
> >>
> >>          if (cutoff && pktsize > cutoff)
> >>                  return max_t(int, cutoff, 68U - tp->tcp_header_len);
> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> >> index bde781f..88dcdf2 100644
> >> --- a/net/ipv4/tcp_output.c
> >> +++ b/net/ipv4/tcp_output.c
> >> @@ -1833,7 +1833,8 @@ unsigned int tcp_current_mss(struct sock *sk)
> >>
> >>          if (dst) {
> >>                  u32 mtu = dst_mtu(dst);
> >> -               if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
> >> +               if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||
> >> +                   (tp->snd_wnd && tp->snd_wnd < mss_now))
> >>                          mss_now = tcp_sync_mss(sk, mtu);
> >
> >
> > I do not think we want to add code in fast path only to cope with
> > pathological user choices.
> >
>
> Thank you very much for your reply!
>
> I very much agree with your point of view. However, the kernel currently
> accepts the unreasonable RCVBUF set by the developer, that is, the set
> RCVBUF is smaller than MSS. Perhaps, we can avoid setting RCVBUF to be
> smaller than MSS in the sock_setsockopt function. What do you think?

I think this is not trivial to make the limit being MSS dependent,
because SO_RCVBUF can be set before connection is attempted.
(So the MSS is not yet known)

I would rather simply add a sysctl to set a floor for TCP SO_RCVBUF,
and default it to a reasonable value (128KB ?)

The sysctl might be needed for some crazy environments (one million
TCP flows on a host with only 8 GB of memory for example...)

>
> Thanks.
>
> > Maybe it is time to accept the fact that setting the receive buffer
> > size to 42KB is dumb nowadays.
> >
> > TCP needs to have at least two buffers in flight to avoid 'strange' stalls.
> >
> > If loopback MTU is 64KB  (and TSO/GRO sizes also can reach 64K), maybe
> > we need to ensure a minimal rcvbuf.
> >
> > If applications set 42KB window size, they get what they asked for :
> > damn slow TCP communications.
> >
> > We do not want to make them happy, while increasing cpu costs for 99%
> > of other uses cases which are not trying to make TCP miserable.
> >
> > I would rather add a sysctl or something to ensure rcvbuf has a
> > minimal sane value, instead of risking subtle TCP regressions.
> >
> > In 2021, we should not limit ourselves to memory constraints that were
> > common 40 years ago when TCP was invented.
> >
> > References :
> > commit 9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix
> > TCP_SKB_MIN_TRUESIZE
> > commit eea86af6b1e18d6fa8dc959e3ddc0100f27aff9f net: sock: adapt
> > SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
> >