netdev - Re: [PATCH net-next] tcp: add net.ipv4.tcp_comp_sack_rtt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iJpinbbrU2YxiWNQa9b2vQ035A75g00hWFLce-CAJi9hA@mail.gmail.com>
Date: Wed, 5 Nov 2025 08:37:48 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH net-next] tcp: add net.ipv4.tcp_comp_sack_rtt_percent

On Wed, Nov 5, 2025 at 6:41 AM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Wed, Nov 5, 2025 at 4:38 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > TCP SACK compression has been added in 2018 in commit
> > 5d9f4262b7ea ("tcp: add SACK compression").
> >
> > It is working great for WAN flows (with large RTT).
> > Wifi in particular gets a significant boost _when_ ACK are suppressed.
> >
> > Add a new sysctl so that we can tune the very conservative 5 % value
> > that has been used so far in this formula, so that small RTT flows
> > can benefit from this feature.
> >
> > delay = min ( 5 % of RTT, 1 ms)
> >
> > This patch adds new tcp_comp_sack_rtt_percent sysctl
> > to ease experiments and tuning.
> >
> > Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl),
> > set the default value to 100.
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > ---
> >  Documentation/networking/ip-sysctl.rst | 13 +++++++++++--
> >  include/net/netns/ipv4.h               |  1 +
> >  net/ipv4/sysctl_net_ipv4.c             |  9 +++++++++
> >  net/ipv4/tcp_input.c                   | 26 ++++++++++++++++++--------
> >  net/ipv4/tcp_ipv4.c                    |  1 +
> >  5 files changed, 40 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> > index 7cd35bfd39e68c5b2650eb9d0fbb76e34aed3f2b..ebc11f593305bf87e7d4ad4d50ef085b22aef7da 100644
> > --- a/Documentation/networking/ip-sysctl.rst
> > +++ b/Documentation/networking/ip-sysctl.rst
> > @@ -854,9 +854,18 @@ tcp_sack - BOOLEAN
> >
> >         Default: 1 (enabled)
> >
> > +tcp_comp_sack_rtt_percent - INTEGER
> > +       Percentage of SRTT used for the compressed SACK feature.
> > +       See tcp_comp_sack_nr, tcp_comp_sack_delay_ns, tcp_comp_sack_slack_ns.
> > +
> > +       Possible values : 1 - 1000
> > +
> > +       Default : 100 %
>
> Overall the patch looks great to me, but for the default value I would
> suggest 33% rather than 100%.
>
> AFAICT, basically, in data center environments with RTT < 1ms, if
> tcp_comp_sack_rtt_percent  is 100% and we allow the data receiver to
> wait a whole SRTT before it sends an ACK, then the data sender is
> likely to have fully used up its cwnd at the point that the receiver
> finally sends the ACK at the end of the SRTT. That means that for the
> entire time that the ACK is traveling from the data receiver to the
> data sender, the data sender has no permission (from congestion
> control) to send. So the "pipe" (data sender -> data receiver network
> path) will have an idle bubble for a long time while the data sender
> is waiting for the ACK. In these cases, the system would lose
> pipelining and would end up in a "stop and wait" mode.
>
> The rationale for 33% is basically to try to facilitate pipelining,
> where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so
> that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in
> flight" at all times:
>
> + 1 skb in the qdisc waiting to be sent by the NIC next
> + 1 skb being sent by the NIC (being serialized by the NIC out onto the wire)
> + 1 skb being received and aggregated by the receiver machine's
> aggregation mechanism (some combination of LRO, GRO, and sack
> compression)
>
> Note that this is basically the same magic number (3) and the same
> rationales as:
>
> (a) tcp_tso_should_defer() ensuring that we defer sending data for no
> longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3),
> and
> (b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO
> skbs to maintain pipelining and full throughput at low RTTs
>
> It is also similar in spirit to your 2014 patch that limits GSO skbs
> to half the cwnd, again to help maintain pipelining:
>
> tcp: limit GSO packets to half cwnd
> d649a7a81f3b5bacb1d60abd7529894d8234a666
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d649a7a81f3b5bacb1d60abd7529894d8234a666
>
> WDYT?

This all makes sense Neal, thank you !

I was mainly focused on reordering issues more than actual drops :)