[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLeRbVWxBOXRRmLNqiF8Lz3-iykCzkVE=_JR7GEycyhSw@mail.gmail.com>
Date: Tue, 18 Nov 2025 08:10:10 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Neal Cardwell <ncardwell@...gle.com>, Rick Jones <jonesrick@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH net-next 2/2] tcp: add net.ipv4.tcp_rtt_threshold sysctl
On Tue, Nov 18, 2025 at 6:04 AM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Mon, Nov 17, 2025 at 8:28 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > This is a follow up of commit aa251c84636c ("tcp: fix too slow
> > tcp_rcvbuf_grow() action") which brought again the issue that I tried
> > to fix in commit 65c5287892e9 ("tcp: fix sk_rcvbuf overshoot")
> >
> > We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d0d
> > ("tcp: increase tcp_rmem[2] to 32 MB")
> >
> > Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf
> > too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can
> > force NIC driver to not recycle pages from the page pool, and also
> > can cause cache evictions for DDIO enabled cpus/NIC, as receivers
> > are usually slower than senders.
> >
> > Add net.ipv4.tcp_rtt_threshold sysctl, set by default to 1000 usec (1 ms)
> > If RTT if smaller than the sysctl value, use the RTT/tcp_rtt_threshold
> > ratio to control sk_rcvbuf inflation.
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> > Documentation/networking/ip-sysctl.rst | 10 ++++++++++
> > .../net_cachelines/netns_ipv4_sysctl.rst | 1 +
> > include/net/netns/ipv4.h | 1 +
> > net/core/net_namespace.c | 2 ++
> > net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
> > net/ipv4/tcp_input.c | 18 ++++++++++++++----
> > net/ipv4/tcp_ipv4.c | 1 +
> > 7 files changed, 38 insertions(+), 4 deletions(-)
> >
> > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> > index 2bae61be18593a8111a83d9f034517e4646eb653..ce2a223e17a61b40fc35b2528c8ee4cf8f750993 100644
> > --- a/Documentation/networking/ip-sysctl.rst
> > +++ b/Documentation/networking/ip-sysctl.rst
> > @@ -673,6 +673,16 @@ tcp_moderate_rcvbuf - BOOLEAN
> >
> > Default: 1 (enabled)
> >
> > +tcp_rtt_threshold - INTEGER
> > + rcvbuf autotuning can over estimate final socket rcvbuf, which
> > + can lead to cache trashing for high throughput flows.
> > +
> > + For small RTT flows (below tcp_rtt_threshold usecs), we can relax
> > + rcvbuf growth: Few additional ms to reach the final (and smaller)
> > + rcvbuf is a good tradeoff.
> > +
> > + Default : 1000 (1 ms)
>
> Thanks, Eric! The logic of this code looks good to me.
>
> For the name of the sysctl, perhaps we can pick something more
> specific than "tcp_rtt_threshold", to clarify to the reader what the
> RTT threshold is used for?
I forgot Rick Jones suggested tcp_autotune_rtt_threshold .
>
> And if the name is more specific about what the threshold is for, then
> in the future if we want RTT thresholds for other behavior (e.g.,
> tuning the tcp_tso_rtt_log code or other future RTT-based
> mechanisms?), then it will be easier to add those future RTT
> thresholds without the names seeming confusing and error-prone?
>
> With the existing "tcp_moderate_rcvbuf" sysctl in mind, how about a
> name like "tcp_rcvbuf_low_rtt"?
I am not good at names, maybe we should reconcile RIck suggestion with yours ?
>
> Then the description in ip-sysctl.rst could read as something like:
> "We can relax rcvbuf growth for low RTT flows (with RTT below
> tcp_rcvbuf_low_rtt usecs):".
>
> WDYT?
> neal
Powered by blists - more mailing lists