linux-kernel - Re: [PATCH net-next v4] tcp: extend tcp_retransmit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250714164625.788f7044@kernel.org>
Date: Mon, 14 Jul 2025 16:46:25 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: <fan.yu9@....com.cn>
Cc: <edumazet@...gle.com>, <kuniyu@...zon.com>, <ncardwell@...gle.com>,
 <davem@...emloft.net>, <netdev@...r.kernel.org>,
 <linux-kernel@...r.kernel.org>, <linux-trace-kernel@...r.kernel.org>,
 <yang.yang29@....com.cn>, <xu.xin16@....com.cn>, <tu.qiang35@....com.cn>,
 <jiang.kun2@....com.cn>
Subject: Re: [PATCH net-next v4] tcp: extend tcp_retransmit_skb tracepoint
 with failure reasons

On Thu, 10 Jul 2025 10:01:38 +0800 (CST) fan.yu9@....com.cn wrote:
> Background
> ==========
> When TCP retransmits a packet due to missing ACKs, the
> retransmission may fail for various reasons (e.g., packets
> stuck in driver queues, sequence errors, or routing issues).
> 
> The original tcp_retransmit_skb tracepoint:
> 'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")'
> lacks visibility into these failure causes, making production
> diagnostics difficult.
> 
> Solution
> ========
> Adds a "result" field to the tcp_retransmit_skb tracepoint,
> enumerating with explicit failure cases:
> TCP_RETRANS_ERR_DEFAULT (retransmit terminate unexpectedly)
> TCP_RETRANS_IN_HOST_QUEUE (packet still queued in driver)
> TCP_RETRANS_END_SEQ_ERROR (invalid end sequence)
> TCP_RETRANS_NOMEM (retransmit no memory)
> TCP_RETRANS_ROUTE_FAIL (routing failure)
> TCP_RETRANS_RCV_ZERO_WINDOW (closed receiver window)

Have you tried to use this or perform some analysis of which of these
reasons actually make sense to add? I'd venture a guess that
IN_HOST_QUEUE will dominate in datacenter. Maybe RCV_ZERO_WINDOW
can happen. Tracing ENOMEM is a waste of time, so is this:

 		if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
            >>>>>	WARN_ON_ONCE(1);  <<<<<<<<
-			return -EINVAL;
+			result = TCP_RETRANS_END_SEQ_ERROR;
-- 
pw-bot: cr