[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyk4=w-Qbw24na3_09SRfbP3w+W9trzhd2vOLfeVui8-BA@mail.gmail.com>
Date: Tue, 18 Nov 2025 09:04:14 -0500
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH net-next 2/2] tcp: add net.ipv4.tcp_rtt_threshold sysctl
On Mon, Nov 17, 2025 at 8:28 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> This is a follow up of commit aa251c84636c ("tcp: fix too slow
> tcp_rcvbuf_grow() action") which brought again the issue that I tried
> to fix in commit 65c5287892e9 ("tcp: fix sk_rcvbuf overshoot")
>
> We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d0d
> ("tcp: increase tcp_rmem[2] to 32 MB")
>
> Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf
> too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can
> force NIC driver to not recycle pages from the page pool, and also
> can cause cache evictions for DDIO enabled cpus/NIC, as receivers
> are usually slower than senders.
>
> Add net.ipv4.tcp_rtt_threshold sysctl, set by default to 1000 usec (1 ms)
> If RTT if smaller than the sysctl value, use the RTT/tcp_rtt_threshold
> ratio to control sk_rcvbuf inflation.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
> Documentation/networking/ip-sysctl.rst | 10 ++++++++++
> .../net_cachelines/netns_ipv4_sysctl.rst | 1 +
> include/net/netns/ipv4.h | 1 +
> net/core/net_namespace.c | 2 ++
> net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
> net/ipv4/tcp_input.c | 18 ++++++++++++++----
> net/ipv4/tcp_ipv4.c | 1 +
> 7 files changed, 38 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 2bae61be18593a8111a83d9f034517e4646eb653..ce2a223e17a61b40fc35b2528c8ee4cf8f750993 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -673,6 +673,16 @@ tcp_moderate_rcvbuf - BOOLEAN
>
> Default: 1 (enabled)
>
> +tcp_rtt_threshold - INTEGER
> + rcvbuf autotuning can over estimate final socket rcvbuf, which
> + can lead to cache trashing for high throughput flows.
> +
> + For small RTT flows (below tcp_rtt_threshold usecs), we can relax
> + rcvbuf growth: Few additional ms to reach the final (and smaller)
> + rcvbuf is a good tradeoff.
> +
> + Default : 1000 (1 ms)
Thanks, Eric! The logic of this code looks good to me.
For the name of the sysctl, perhaps we can pick something more
specific than "tcp_rtt_threshold", to clarify to the reader what the
RTT threshold is used for?
And if the name is more specific about what the threshold is for, then
in the future if we want RTT thresholds for other behavior (e.g.,
tuning the tcp_tso_rtt_log code or other future RTT-based
mechanisms?), then it will be easier to add those future RTT
thresholds without the names seeming confusing and error-prone?
With the existing "tcp_moderate_rcvbuf" sysctl in mind, how about a
name like "tcp_rcvbuf_low_rtt"?
Then the description in ip-sysctl.rst could read as something like:
"We can relax rcvbuf growth for low RTT flows (with RTT below
tcp_rcvbuf_low_rtt usecs):".
WDYT?
neal
Powered by blists - more mailing lists