lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250715183335.860529-1-kuniyu@google.com>
Date: Tue, 15 Jul 2025 18:33:12 +0000
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: fan.yu9@....com.cn
Cc: davem@...emloft.net, dsahern@...nel.org, edumazet@...gle.com, 
	he.peilin@....com.cn, horms@...nel.org, jiang.kun2@....com.cn, 
	kuba@...nel.org, kuniyu@...gle.com, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, ncardwell@...gle.com, 
	netdev@...r.kernel.org, pabeni@...hat.com, qiu.yutan@....com.cn, 
	tu.qiang35@....com.cn, wang.yaxin@....com.cn, xu.xin16@....com.cn, 
	yang.yang29@....com.cn
Subject: Re: [PATCH net-next v5] tcp: trace retransmit failures in tcp_retransmit_skb

[ updated my email address, please use this one for v6+ ]

From: <fan.yu9@....com.cn>
Date: Tue, 15 Jul 2025 23:52:49 +0800 (CST)
> From: Fan Yu <fan.yu9@....com.cn>
> 
> Background
> ==========
> When TCP retransmits a packet due to missing ACKs, the
> retransmission may fail for various reasons (e.g., packets
> stuck in driver queues, receiver zero windows, or routing issues).
> 
> The original tcp_retransmit_skb tracepoint:
> &apos;commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")&apos;

Looks like your email client converts "'" into "&apos;" and
corrupts the diff, see below.


> lacks visibility into these failure causes, making production
> diagnostics difficult.
> 
> Solution
> ========
> Adds the retval("err") to the tcp_retransmit_skb tracepoint.
> Enables users to know why some tcp retransmission failed and
> users can filter retransmission failures by retval.
> 
> Compatibility description
> =========================
> This patch extends the tcp_retransmit_skb tracepoint
> by adding a new "err" field at the end of its
> existing structure (within TP_STRUCT__entry). The
> compatibility implications are detailed as follows:
> 
> 1) Structural compatibility for legacy user-space tools
> 
> Legacy tools/BPF programs accessing existing fields
> (by offset or name) can still work without modification
> or recompilation.The new field is appended to the end,
> preserving original memory layout.
> 
> 2) Note: semantic changes
> 
> The original tracepoint primarily only focused on
> successfully retransmitted packets. With this patch,
> the tracepoint now can figure out packets that may
> terminate early due to specific reasons. For accurate
> statistics, users should filter using "err" to
> distinguish outcomes.
> 
> Before patched:
> # cat /sys/kernel/debug/tracing/events/tcp/tcp_retransmit_skb/format
> field:const void * skbaddr; offset:8; size:8; signed:0;
> field:const void * skaddr; offset:16; size:8; signed:0;
> field:int state; offset:24; size:4; signed:1;
> field:__u16 sport; offset:28; size:2; signed:0;
> field:__u16 dport; offset:30; size:2; signed:0;
> field:__u16 family; offset:32; size:2; signed:0;
> field:__u8 saddr[4]; offset:34; size:4; signed:0;
> field:__u8 daddr[4]; offset:38; size:4; signed:0;
> field:__u8 saddr_v6[16]; offset:42; size:16; signed:0;
> field:__u8 daddr_v6[16]; offset:58; size:16; signed:0;
> 
> print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s"
> 
> After patched:
> # cat /sys/kernel/debug/tracing/events/tcp/tcp_retransmit_skb/format
> field:const void * skbaddr; offset:8; size:8; signed:0;
> field:const void * skaddr; offset:16; size:8; signed:0;
> field:int state; offset:24; size:4; signed:1;
> field:__u16 sport; offset:28; size:2; signed:0;
> field:__u16 dport; offset:30; size:2; signed:0;
> field:__u16 family; offset:32; size:2; signed:0;
> field:__u8 saddr[4]; offset:34; size:4; signed:0;
> field:__u8 daddr[4]; offset:38; size:4; signed:0;
> field:__u8 saddr_v6[16]; offset:42; size:16; signed:0;
> field:__u8 daddr_v6[16]; offset:58; size:16; signed:0;
> field:int err;  offset:76;      size:4; signed:1;
> 
> print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s err=%d"
> 
> Change Log
> =========
> v4->v5:
> Some fixes according to
> https://lore.kernel.org/all/20250715072058.12f343bb@kernel.org/
> 1. Instead of introducing new TCP_RETRANS_* enums, directly
> passing the retval to the tracepoint.
> 
> v3->v4:
> Some fixes according to
> https://lore.kernel.org/all/CANn89i+JGSt=_CtWfhDXypWW-34a6SoP3RAzWQ9B9VL4+PHjDw@mail.gmail.com/
> 1. Consolidate ENOMEMs into a unified TCP_RETRANS_NOMEM.
> 
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iJvyYjiweCESQL8E-Si7M=gosYvh1BAVWwAWycXW8GSdg@mail.gmail.com/
> 1. Rename "quit_reason" to "result". Also, keep "key=val" format concise(no space in vals).
> 
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iK-6kT-ZUpNRMjPY9_TkQj-dLuKrDQtvO1140q4EUsjFg@mail.gmail.com/
> 1.Rename TCP_RETRANS_QUIT_UNDEFINED to TCP_RETRANS_ERR_DEFAULT.
> 2.Added detailed compatibility consequences section.
> 
> Suggested-by: Jakub Kicinski <kuba@...nel.org>
> Suggested-by: Eric Dumazet <edumazet@...gle.com>
> Co-developed-by: xu xin <xu.xin16@....com.cn>
> Signed-off-by: xu xin <xu.xin16@....com.cn>
> Signed-off-by: Fan Yu <fan.yu9@....com.cn>
> ---

Please add changelog here so it will disappear when the patch is merged.


> include/trace/events/tcp.h | 27 ++++++++--------------
> net/ipv4/tcp_output.c      | 46 ++++++++++++++++++++++++--------------
> 2 files changed, 38 insertions(+), 35 deletions(-)
> 
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 54e60c6009e3..9d2c36c6a0ed 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -13,17 +13,11 @@
> #include <linux/sock_diag.h>
> #include <net/rstreason.h>
> 
> -/*
> - * tcp event with arguments sk and skb
> - *
> - * Note: this class requires a valid sk pointer; while skb pointer could
> - *       be NULL.
> - */
> -DECLARE_EVENT_CLASS(tcp_event_sk_skb,
> +TRACE_EVENT(tcp_retransmit_skb,
> 
> -	TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> +	TP_PROTO(const struct sock *sk, const struct sk_buff *skb, int err),
> 
> -	TP_ARGS(sk, skb),
> +	TP_ARGS(sk, skb, err),
> 
>  	TP_STRUCT__entry(
>  		__field(const void *, skbaddr)
> @@ -36,6 +30,7 @@ DECLARE_EVENT_CLASS(tcp_event_sk_skb,
>  		__array(__u8, daddr, 4)
>  		__array(__u8, saddr_v6, 16)
>  		__array(__u8, daddr_v6, 16)
> +		__field(int, err)
>  	),
> 
>  	TP_fast_assign(
> @@ -58,21 +53,17 @@ DECLARE_EVENT_CLASS(tcp_event_sk_skb,
> 
>  		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
>  			sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
> +
> +		__entry->err = err;
>  	),
> 
> -	TP_printk("skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s",
> +	TP_printk("skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s err=%d",
>  		__entry->skbaddr, __entry->skaddr,
>  		show_family_name(__entry->family),
>  		__entry->sport, __entry->dport, __entry->saddr, __entry->daddr,
>  		__entry->saddr_v6, __entry->daddr_v6,
> -		  show_tcp_state_name(__entry->state))
> -);
> -
> -DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
> -
> -	TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> -
> -	TP_ARGS(sk, skb)
> +		  show_tcp_state_name(__entry->state),
> +		  __entry->err)
> );
> 
> #undef FN
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b616776e3354..c31e164693d5 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3330,8 +3330,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  	if (icsk->icsk_mtup.probe_size)
>  		icsk->icsk_mtup.probe_size = 0;
> 
> -	if (skb_still_in_host_queue(sk, skb))
> -		return -EBUSY;
> +	if (skb_still_in_host_queue(sk, skb)) {
> +		err = -EBUSY;
> +		goto out;
> +	}
> 
> start:
>  	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
> @@ -3342,14 +3344,19 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  		}
>  		if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
>  			WARN_ON_ONCE(1);
> -			return -EINVAL;
> +			err = -EINVAL;
> +			goto out;
> +		}
> +		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) {
> +			err = -ENOMEM;
> +			goto out;
>  		}
> -		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
> -			return -ENOMEM;
>  	}
> 
> -	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
> -		return -EHOSTUNREACH; /* Routing failure or similar. */
> +	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk)) {
> +		err = -EHOSTUNREACH; /* Routing failure or similar. */
> +		goto out;
> +	}
> 
>  	cur_mss = tcp_current_mss(sk);
>  	avail_wnd = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
> @@ -3360,8 +3367,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  	* our retransmit of one segment serves as a zero window probe.
>  	*/
>  	if (avail_wnd <= 0) {
> -		if (TCP_SKB_CB(skb)->seq != tp->snd_una)
> -			return -EAGAIN;
> +		if (TCP_SKB_CB(skb)->seq != tp->snd_una) {
> +			err = -EAGAIN;
> +			goto out;
> +		}
>  		avail_wnd = cur_mss;
>  	}
> 
> @@ -3373,11 +3382,15 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  	}
>  	if (skb->len > len) {
>  		if (tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb, len,
> -				 cur_mss, GFP_ATOMIC))
> -			return -ENOMEM; /* We&apos;ll try again later. */
> +				 cur_mss, GFP_ATOMIC)) {
> +			err = -ENOMEM;  /* We&apos;ll try again later. */

This chunk is corrupted by "'" conversion.
Please check your email client config.



> +			goto out;
> +		}
>  	} else {
> -		if (skb_unclone_keeptruesize(skb, GFP_ATOMIC))
> -			return -ENOMEM;
> +		if (skb_unclone_keeptruesize(skb, GFP_ATOMIC)) {
> +			err = -ENOMEM;
> +			goto out;
> +		}
> 
>  		diff = tcp_skb_pcount(skb);
>  		tcp_set_skb_tso_segs(skb, cur_mss);
> @@ -3431,17 +3444,16 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
>  		tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB,
>  				TCP_SKB_CB(skb)->seq, segs, err);
> 
> -	if (likely(!err)) {
> -		trace_tcp_retransmit_skb(sk, skb);
> -	} else if (err != -EBUSY) {
> +	if (err != -EBUSY)
>  		NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL, segs);

Now this is counted when !err.


> -	}
> 
>  	/* To avoid taking spuriously low RTT samples based on a timestamp
>  	* for a transmit that never happened, always mark EVER_RETRANS
>  	*/
>  	TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
> 
> +out:
> +	trace_tcp_retransmit_skb(sk, skb, err);
>  	return err;
> }
> 
> --
> 2.25.1
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ