lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <PH7PR21MB31160E283AF982AE4D675BC3CA54A@PH7PR21MB3116.namprd21.prod.outlook.com>
Date:   Mon, 12 Jun 2023 22:05:38 +0000
From:   Haiyang Zhang <haiyangz@...rosoft.com>
To:     Kuniyuki Iwashima <kuniyu@...zon.com>
CC:     "ncardwell@...gle.com" <ncardwell@...gle.com>,
        "atenart@...nel.org" <atenart@...nel.org>,
        "bagasdotme@...il.com" <bagasdotme@...il.com>,
        "corbet@....net" <corbet@....net>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "dsahern@...nel.org" <dsahern@...nel.org>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "kuba@...nel.org" <kuba@...nel.org>,
        KY Srinivasan <kys@...rosoft.com>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "liushixin2@...wei.com" <liushixin2@...wei.com>,
        "maheshb@...gle.com" <maheshb@...gle.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "olaf@...fle.de" <olaf@...fle.de>,
        "pabeni@...hat.com" <pabeni@...hat.com>,
        "simon.horman@...igine.com" <simon.horman@...igine.com>,
        "soheil@...gle.com" <soheil@...gle.com>,
        "stephen@...workplumber.org" <stephen@...workplumber.org>,
        Tim Gardner <tim.gardner@...onical.com>,
        "vkuznets@...hat.com" <vkuznets@...hat.com>,
        "weiwan@...gle.com" <weiwan@...gle.com>,
        "ycheng@...gle.com" <ycheng@...gle.com>,
        "ykaliuta@...hat.com" <ykaliuta@...hat.com>
Subject: RE: [PATCH net-next] tcp: Make pingpong threshold tunable



> -----Original Message-----
> From: Kuniyuki Iwashima <kuniyu@...zon.com>
> Sent: Friday, June 9, 2023 3:54 PM
> To: Haiyang Zhang <haiyangz@...rosoft.com>
> Cc: ncardwell@...gle.com; atenart@...nel.org; bagasdotme@...il.com;
> corbet@....net; davem@...emloft.net; dsahern@...nel.org;
> edumazet@...gle.com; kuba@...nel.org; kuniyu@...zon.com; KY
> Srinivasan <kys@...rosoft.com>; linux-doc@...r.kernel.org; linux-
> hyperv@...r.kernel.org; linux-kernel@...r.kernel.org;
> liushixin2@...wei.com; maheshb@...gle.com; netdev@...r.kernel.org;
> olaf@...fle.de; pabeni@...hat.com; simon.horman@...igine.com;
> soheil@...gle.com; stephen@...workplumber.org; Tim Gardner
> <tim.gardner@...onical.com>; vkuznets@...hat.com; weiwan@...gle.com;
> ycheng@...gle.com; ykaliuta@...hat.com
> Subject: Re: [PATCH net-next] tcp: Make pingpong threshold tunable
> 
> [You don't often get email from kuniyu@...zon.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Neal Cardwell <ncardwell@...gle.com>
> Date: Fri, 9 Jun 2023 15:16:00 -0400
> > On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@...rosoft.com>
> wrote:
> >
> > Regarding the patch title:
> > > [PATCH net-next] tcp: Make pingpong threshold tunable
> >
> > There are many ways to make something tunable these days, including
> > BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please
> > consider a more clear/specific title, like:
> >
> >    [PATCH net-next] tcp: set pingpong threshold via sysctl
> >
> > > TCP pingpong threshold is 1 by default. But some applications, like SQL DB
> > > may prefer a higher pingpong threshold to activate delayed acks in quick
> > > ack mode for better performance.
> > >
> > > The pingpong threshold and related code were changed to 3 in the year
> > > 2019, and reverted to 1 in the year 2022.
> >
> > Please include the specific commit, like:
> >
> > The pingpong threshold and related code were changed to 3 in the year
> >  2019 in:
> >    commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > and reverted to 1 in the year 2022 in:
> >   commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")
> >
> > Then please make sure to use scripts/checkpatch.pl on your resulting
> > patch to check the formatting of the commit references, among other
> > things.
> >
> > > There is no single value that
> > > fits all applications.
> > >
> > > Add net.core.tcp_pingpong_thresh sysctl tunable,
> >
> > For consistency, TCP sysctls should be in net.ipv4 rather than
> > net.core. Yes, that is awkward, given IPv6 support. But consistency is
> > very important here. :-)
> >
> > > so it can be tuned for
> > > optimal performance based on the application needs.
> > >
> > > Signed-off-by: Haiyang Zhang <haiyangz@...rosoft.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/net.rst |  8 ++++++++
> > >  include/net/inet_connection_sock.h       | 14 +++++++++++---
> > >  net/core/sysctl_net_core.c               |  9 +++++++++
> > >  net/ipv4/tcp.c                           |  2 ++
> > >  net/ipv4/tcp_output.c                    | 17 +++++++++++++++--
> > >  5 files changed, 45 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst
> b/Documentation/admin-guide/sysctl/net.rst
> > > index 4877563241f3..16f54be9461f 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -413,6 +413,14 @@ historical importance.
> > >
> > >  Default: 0
> > >
> > > +tcp_pingpong_thresh
> > > +-------------------
> > > +
> > > +TCP pingpong threshold is 1 by default, but some application may need a
> higher
> > > +threshold for optimal performance.
> > > +
> > > +Default: 1, min: 1, max: 3
> >
> > If we want to make this tunable, it seems sad to make the max 3. I'd
> > suggest making the max 255, since we have 8 bits of space anyway in
> > the inet_csk(sk)->icsk_ack.pingpong field.
> >
> > > +
> > >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > >  ----------------------------------------------------------
> > >
> > > diff --git a/include/net/inet_connection_sock.h
> b/include/net/inet_connection_sock.h
> > > index c2b15f7e5516..e84e33ddae49 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct
> inet_bind_bucket *tb,
> > >
> > >  struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
> > >
> > > -#define TCP_PINGPONG_THRESH    1
> > > +extern int tcp_pingpong_thresh;
> >
> > To match most TCP sysctls, this should be per-namespace, rather than
> global.
> 
> Also, please change int to u8.
> 
> 
> >
> > Please follow a recent example by Eric, perhaps:
> >  65466904b015f6eeb9225b51aeb29b01a1d4b59c
> >   tcp: adjust TSO packet sizes based on min_rtt
> >
> >
> > >
> > >  static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
> > >  {
> > > -       inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
> > > +       inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
> > >  }
> >
> >   inet_csk(sk)->icsk_ack.pingpong =  sock_net(sk)-
> >sysctl_tcp_pingpong_thresh;
> 
> Let's use READ_ONCE(sock_net(sk)->sysctl_tcp_pingpong_thresh).
> Same for other sysctl reads.
> 
> 
> >
> > >  static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> > > @@ -338,7 +338,15 @@ static inline void
> inet_csk_exit_pingpong_mode(struct sock *sk)
> > >
> > >  static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
> > >  {
> > > -       return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
> > > +       return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
> > > +}
> >
> > Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than
> tcp_pingpong_thresh.
> >
> > > +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
> > > +{
> > > +       struct inet_connection_sock *icsk = inet_csk(sk);
> > > +
> > > +       if (icsk->icsk_ack.pingpong < U8_MAX)
> > > +               icsk->icsk_ack.pingpong++;
> > >  }
> > >
> > >  static inline bool inet_csk_has_ulp(struct sock *sk)
> > > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> > > index 782273bb93c2..b5253567f2bd 100644
> > > --- a/net/core/sysctl_net_core.c
> > > +++ b/net/core/sysctl_net_core.c
> > > @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {
> >
> > Again, in net.ipv4, not net.core.
> >
> > >                 .proc_handler   = proc_dointvec_minmax,
> > >                 .extra1         = SYSCTL_ZERO,
> > >         },
> > > +       {
> > > +               .procname       = "tcp_pingpong_thresh",
> > > +               .data           = &tcp_pingpong_thresh,
> > > +               .maxlen         = sizeof(int),
> > > +               .mode           = 0644,
> > > +               .proc_handler   = proc_dointvec_minmax,
> > > +               .extra1         = SYSCTL_ONE,
> > > +               .extra2         = SYSCTL_THREE,
> >
> > Please make the max U8_MAX to allow more flexibility (since we have 8
> > bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field).
> 
> Please use proc_dou8vec_minmax(), then you can drop .extra2.
> 
>                 .maxlen         = sizeof(u8),
>                 .mode           = 0644,
>                 .proc_handler   = proc_dou8vec_minmax,
>                 .extra1         = SYSCTL_ONE,
> 
> Thanks,
> Kuniyuki
> 
> >
> > > +       },
> > >         { }
> > >  };
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 53b7751b68e1..dcd143193d41 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
> > >  struct percpu_counter tcp_sockets_allocated
> ____cacheline_aligned_in_smp;
> > >  EXPORT_SYMBOL(tcp_sockets_allocated);
> > >
> > > +int tcp_pingpong_thresh __read_mostly = 1;
> > > +
> >
> > Again, per-network-namespace. You will need to initialize the
> > per-netns value in tcp_sk_init(). Again, see Eric's
> > 65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example.
> >
> > >   * TCP splice context
> > >   */
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index cfe128b81a01..576d21621778 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct
> tcp_sock *tp,
> > >         if (tcp_packets_in_flight(tp) == 0)
> > >                 tcp_ca_event(sk, CA_EVENT_TX_START);
> > >
> > > +       /* If tcp_pingpong_thresh > 1, and
> > > +        * this is the first data packet sent in response to the
> > > +        * previous received data,
> > > +        * and it is a reply for ato after last received packet,
> > > +        * increase pingpong count.
> > > +        */
> > > +       if (tcp_pingpong_thresh > 1 &&
> > > +           before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > +               inet_csk_inc_pingpong_cnt(sk);
> > > +
> >
> > Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5.
> > As that commit description noted:
> >
> >     This to-be-reverted commit was meant to apply a stricter rule for the
> >     stack to enter pingpong mode. However, the condition used to check for
> >     interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> >     jiffy based and might be too coarse, which delays the stack entering
> >     pingpong mode.
> >     We revert this patch so that we no longer use the above condition to
> >     determine interactive session,
> >
> > >         tp->lsndtime = now;
> > >
> > > -       /* If it is a reply for ato after last received
> > > +       /* If tcp_pingpong_thresh == 1, and
> >
> > Please remove the "If tcp_pingpong_thresh == 1, and" part, since this
> > is the correct code path no matter the value of the threshold.
> >
> > > +        * it is a reply for ato after last received
> > >          * packet, enter pingpong mode.
> > >          */
> > > -       if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > +       if (tcp_pingpong_thresh == 1 &&
> >
> > Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this
> > is the correct code path no matter the value of the threshold.
> >
> > > +           (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > >                 inet_csk_enter_pingpong_mode(sk);
> >
> > Please make this call inet_csk_inc_pingpong_cnt(), since this is the
> > correct code path no matter the value of the threshold.

Thank Neal and Kuniyuki for the detailed reviews! I will update the code
accordingly. 

- Haiyang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ