netdev - Re: [PATCH net-next 7/8] tcp: stronger sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+G+46d_sruU-ezOSJJU0SONaN6-GDyXAOg2BVSN9Px1w@mail.gmail.com>
Date: Fri, 19 Dec 2025 11:12:57 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Christian Ebner <c.ebner@...xmox.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Neal Cardwell <ncardwell@...gle.com>, 
	Simon Horman <horms@...nel.org>, Kuniyuki Iwashima <kuniyu@...gle.com>, 
	Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	lkolbe@...iuswillert.com
Subject: Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks

On Fri, Dec 19, 2025 at 11:00 AM Christian Ebner <c.ebner@...xmox.com> wrote:
>
> On 12/19/25 9:45 AM, Eric Dumazet wrote:
> > On Fri, Dec 19, 2025 at 9:23 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >>
> >> On Thu, Dec 18, 2025 at 3:58 PM Christian Ebner <c.ebner@...xmox.com> wrote:
> >>>
> >>> On 12/18/25 2:19 PM, Eric Dumazet wrote:
> >>>> On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@...xmox.com> wrote:
> >>>>>
> >>>>> Hi Eric,
> >>>>>
> >>>>> thank you for your reply!
> >>>>>
> >>>>> On 12/18/25 11:10 AM, Eric Dumazet wrote:
> >>>>>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
> >>>>>
> >>>>> Affected users report they have the respective kernels defaults set, so:
> >>>>> - "4096 131072 6291456"  for v.617 builds
> >>>>> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
> >>>>>
> >>>>>> It seems your application is enforcing a small SO_RCVBUF ?
> >>>>>
> >>>>> No, we can exclude that since the output of `ss -tim` show the default
> >>>>> buffer size after connection being established and growing up to the max
> >>>>> value during traffic (backups being performed).
> >>>>>
> >>>>
> >>>> The trace you provided seems to show a very different picture ?
> >>>>
> >>>> [::ffff:10.xx.xx.aa]:8007
> >>>>          [::ffff:10.xx.xx.bb]:55554
> >>>>             skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> >>>> wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> >>>> rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> >>>> bytes_received:1295747055 segs_out:301010 segs_in:162410
> >>>> data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> >>>> lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> >>>> delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> >>>> rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
> >>>>
> >>>> rb7488 would suggest the application has played with a very small SO_RCVBUF,
> >>>> or some memory allocation constraint (memcg ?)
> >>>
> >>> Thanks for the hint were to look, however we checked that the process is
> >>> not memory constrained and the host has no memory pressure.
> >>>
> >>> Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)`
> >>> shows no syscalls which would change the socket buffer size (though this
> >>> still needs to be double checked by affected users for completeness).
> >>>
> >>> Further, the stalls most often happen mid transfer, starting with the
> >>> expected throughput and even might recover from the stall after some
> >>> time, continue at regular speed again.
> >>>
> >>>
> >>> Status update for v6.18
> >>> -----------------------
> >>>
> >>> In the meantime, a user reported 2 stale connections with running kernel
> >>> 6.18+416dd649f3aa
> >>>
> >>> The tcpdump pattern looks slightly different, here we got repeating
> >>> sequences of:
> >>> ```
> >>> 224     5.407981        10.xx.xx.bb     10.xx.xx.aa     TCP     4162    40068 → 8007 [PSH, ACK]
> >>> Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
> >>> 225     5.408064        10.xx.xx.aa     10.xx.xx.bb     TCP     66      8007 → 40068 [ACK] Seq=1
> >>> Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
> >>> ```
> >>>
> >>> The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale
> >>> state, tracing with:
> >>> ```
> >>> perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
> >>> perf script
> >>> ```
> >>> produced some output as shown below, so it seems that tcp_rcvbuf_grow()
> >>> is never called in that case, while tcp_rcv_space_adjust() is.
> >>
> >> Autotuning is not enabled for your case, somehow the application is
> >> not behaving as expected,
>
> Is there a way for us to check if autotuning is enabled for the TCP
> connection at this point in time? Some tracepoint to identify it being
> deactivated?

tcp_rcv_space_adjust() has a tracepoint.

You can also use bpftrace to collect more fields from TCP sockets.

If trace_tcp_rcvbuf_grow() is not called, then the application drains
its receive queue too slowly
for autotune to quick in, or the sender is limited.


>
> >> so maybe you have to change tcp_rmem[2] if a driver is allocating
> >> order-2 pages for the 9K frames.
>
> Same here, is there a way for us to check this? Note however that we
> could not identify a specific NIC/driver to cause the behavior, it
> appears for various vendor models.

I don't have this issue using regular tcp_stream tests and 9K traffic.
Can you try standard programs instead of in-house ones ?
(netperf, neper, iperf3...)

Use a bpftrace program to gather tp->scaling_ratio

bpftrace -e '
k:tcp_rcv_space_adjust {
  $sk = (struct sock *)arg0;
  if ($sk->sk_rcvbuf > 20000) { return ; }
  $tp = (struct tcp_sock *)arg0;
  @scaling[$tp->scaling_ratio] = count();
}
'


>
> >
> > I meant to say : change tcp_rmem[1]
> >
> > echo "4096 262144 33554432" >/proc/sys/net/ipv4/tcp_rmem
>
> Okay, thanks for the suggestion, let me get back to you with results if
> this changes anything.
>
>
> >> You have not given what  was on the sender side (linux or other stack ?)
>
> Clients are all Linux hosts, running kernel versions 6.8, 6.14 or 6.17.
> No other TCP stacks.
>
> Best regards,
> Christian Ebner
>