[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250715123532715Qrm78AnS47Fztgj_NXs20@zte.com.cn>
Date: Tue, 15 Jul 2025 12:35:32 +0800 (CST)
From: <fan.yu9@....com.cn>
To: <kuba@...nel.org>
Cc: <edumazet@...gle.com>, <kuniyu@...zon.com>, <ncardwell@...gle.com>,
<davem@...emloft.net>, <netdev@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-trace-kernel@...r.kernel.org>,
<yang.yang29@....com.cn>, <xu.xin16@....com.cn>,
<tu.qiang35@....com.cn>, <jiang.kun2@....com.cn>, <pabeni@...hat.com>,
<horms@...nel.org>
Subject: Re: [PATCH net-next v4] tcp: extend tcp_retransmit_skb tracepoint with failure reasons
> > Solution
> > ========
> > Adds a "result" field to the tcp_retransmit_skb tracepoint,
> > enumerating with explicit failure cases:
> > TCP_RETRANS_ERR_DEFAULT (retransmit terminate unexpectedly)
> > TCP_RETRANS_IN_HOST_QUEUE (packet still queued in driver)
> > TCP_RETRANS_END_SEQ_ERROR (invalid end sequence)
> > TCP_RETRANS_NOMEM (retransmit no memory)
> > TCP_RETRANS_ROUTE_FAIL (routing failure)
> > TCP_RETRANS_RCV_ZERO_WINDOW (closed receiver window)
>
> Have you tried to use this or perform some analysis of which of these
> reasons actually make sense to add? I'd venture a guess that
> IN_HOST_QUEUE will dominate in datacenter. Maybe RCV_ZERO_WINDOW
> can happen. Tracing ENOMEM is a waste of time, so is this:
Hi Jakub,
Thanks for the feedback. This patch was motivated by a issue
where TCP retransmissions were silently failing, and we eventually
traced the problem to packets stuck in the driver queue(IN_HOST_QUEUE).
However, the current tcp_retransmit_skb tracepoint lacks visibility
into why retransmissions fail. Our goal is to to make tcp_retransmit_skb
more useful by not only tracking retransmission attempts but also their
outcomes. This aligns with the tracepoint’s original purpose but adds
actionable diagnostics.
>
> if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
> >>>>> WARN_ON_ONCE(1); <<<<<<<<
> - return -EINVAL;
> + result = TCP_RETRANS_END_SEQ_ERROR;
I agree that some of the result types (e.g., ENOMEM, END_SEQ_ERROR)
may be redundant or unlikely in practice. If we focus only on the most
critical cases, would the following subset be more acceptable?
- TCP_RETRANS_FAIL_QUEUED (packet stuck in host/driver queue)
- TCP_RETRANS_FAIL_ZERO_WINDOW (receiver window closed)
- TCP_RETRANS_FAIL_ROUTE (routing issues)
- TCP_RETRANS_FAIL_DEFAULT (catch-all for unexpected failures)
Best regards.
Original
From: JakubKicinski <kuba@...nel.org>
To: 范雨10344752;
Cc: edumazet@...gle.com <edumazet@...gle.com>;kuniyu@...zon.com <kuniyu@...zon.com>;ncardwell@...gle.com <ncardwell@...gle.com>;davem@...emloft.net <davem@...emloft.net>;netdev@...r.kernel.org <netdev@...r.kernel.org>;linux-kernel@...r.kernel.org <linux-kernel@...r.kernel.org>;linux-trace-kernel@...r.kernel.org <linux-trace-kernel@...r.kernel.org>;杨洋10192021;徐鑫10311587;涂强10171646;蒋昆10222859;
Date: 2025年07月15日 07:46
Subject: Re: [PATCH net-next v4] tcp: extend tcp_retransmit_skb tracepoint with failure reasons
On Thu, 10 Jul 2025 10:01:38 +0800 (CST) fan.yu9@....com.cn wrote:
> Background
> ==========
> When TCP retransmits a packet due to missing ACKs, the
> retransmission may fail for various reasons (e.g., packets
> stuck in driver queues, sequence errors, or routing issues).
>
> The original tcp_retransmit_skb tracepoint:
> 'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")'
> lacks visibility into these failure causes, making production
> diagnostics difficult.
>
> Solution
> ========
> Adds a "result" field to the tcp_retransmit_skb tracepoint,
> enumerating with explicit failure cases:
> TCP_RETRANS_ERR_DEFAULT (retransmit terminate unexpectedly)
> TCP_RETRANS_IN_HOST_QUEUE (packet still queued in driver)
> TCP_RETRANS_END_SEQ_ERROR (invalid end sequence)
> TCP_RETRANS_NOMEM (retransmit no memory)
> TCP_RETRANS_ROUTE_FAIL (routing failure)
> TCP_RETRANS_RCV_ZERO_WINDOW (closed receiver window)
Have you tried to use this or perform some analysis of which of these
reasons actually make sense to add? I'd venture a guess that
IN_HOST_QUEUE will dominate in datacenter. Maybe RCV_ZERO_WINDOW
can happen. Tracing ENOMEM is a waste of time, so is this:
if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
>>>>> WARN_ON_ONCE(1); <<<<<<<<
- return -EINVAL;
+ result = TCP_RETRANS_END_SEQ_ERROR;
--
pw-bot: cr
<div class="zcontentRow"><p>> > Solution</p><p>> > ========</p><p>> > Adds a "result" field to the tcp_retransmit_skb tracepoint,</p><p>> > enumerating with explicit failure cases:</p><p>> > TCP_RETRANS_ERR_DEFAULT (retransmit terminate unexpectedly)</p><p>> > TCP_RETRANS_IN_HOST_QUEUE (packet still queued in driver)</p><p>> > TCP_RETRANS_END_SEQ_ERROR (invalid end sequence)</p><p>> > TCP_RETRANS_NOMEM (retransmit no memory)</p><p>> > TCP_RETRANS_ROUTE_FAIL (routing failure)</p><p>> > TCP_RETRANS_RCV_ZERO_WINDOW (closed receiver window)</p><p>> </p><p>> Have you tried to use this or perform some analysis of which of these</p><p>> reasons actually make sense to add? I'd venture a guess that</p><p>> IN_HOST_QUEUE will dominate in datacenter. Maybe RCV_ZERO_WINDOW</p><p>> can happen. Tracing ENOMEM is a waste of time, so is this:</p><p>Hi Jakub,</p><p>Thanks for the feedback. This patch was motivated by a issue</p><p>where TCP retransmissions were silently failing, and we eventually</p><p>traced the problem to packets stuck in the driver queue(IN_HOST_QUEUE).</p><p><br></p><p>However, the current tcp_retransmit_skb tracepoint lacks visibility</p><p>into why retransmissions fail. Our goal is to to make tcp_retransmit_skb</p><p>more useful by not only tracking retransmission attempts but also their</p><p>outcomes. This aligns with the tracepoint’s original purpose but adds</p><p>actionable diagnostics.</p><p><br></p><p>> </p><p>> if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {</p><p>> >>>>> WARN_ON_ONCE(1); <<<<<<<<</p><p>> - return -EINVAL;</p><p>> + result = TCP_RETRANS_END_SEQ_ERROR;</p><p><br></p><p>I agree that some of the result types (e.g., ENOMEM, END_SEQ_ERROR)</p><p>may be redundant or unlikely in practice. If we focus only on the most</p><p>critical cases, would the following subset be more acceptable?</p><p>- TCP_RETRANS_FAIL_QUEUED (packet stuck in host/driver queue)</p><p>- TCP_RETRANS_FAIL_ZERO_WINDOW (receiver window closed)</p><p>- TCP_RETRANS_FAIL_ROUTE (routing issues)</p><p>- TCP_RETRANS_FAIL_DEFAULT (catch-all for unexpected failures)</p><p><br></p><p>Best regards.</p><p style="font-size:14px;font-family:微软雅黑,Microsoft YaHei;"><br></p><div unonameen="Fan Yu10344752" unonamech="范雨10344752" class="zMailSign"></div><div class="zhistoryRow" style="display:block"><div class="zhistoryDes" style="width: 100%; height: 28px; line-height: 28px; background-color: #E0E5E9; color: #1388FF; text-align: center;">Original</div><div id="zwriteHistoryContainer"><div class="control-group zhistoryPanel"><div class="zhistoryHeader" style="padding: 8px; background-color: #F5F6F8;"><div><strong>From: </strong><span class="zreadUserName">JakubKicinski <kuba@...nel.org></span></div><div><strong>To: </strong><span class="zreadUserName" style="display: inline;">范雨10344752;</span></div><div><strong>Cc: </strong><span class="zreadUserName" style="display: inline;">edumazet@...gle.com <edumazet@...gle.com>;</span><span class="zreadUserName" style="display: inline;">kuniyu@...zon.com <kuniyu@...zon.com>;</span><span class="zreadUserName" style="display: inline;">ncardwell@...gle.com <ncardwell@...gle.com>;</span><span class="zreadUserName" style="display: inline;">davem@...emloft.net <davem@...emloft.net>;</span><span class="zreadUserName" style="display: inline;">netdev@...r.kernel.org <netdev@...r.kernel.org>;</span><span class="zreadUserName" style="display: inline;">linux-kernel@...r.kernel.org <linux-kernel@...r.kernel.org>;</span><span class="zreadUserName" style="display: inline;">linux-trace-kernel@...r.kernel.org <linux-trace-kernel@...r.kernel.org>;</span><span class="zreadUserName" style="display: inline;">杨洋10192021;</span><span class="zreadUserName" style="display: inline;">徐鑫10311587;</span><span class="zreadUserName" style="display: inline;">涂强10171646;</span><span class="zreadUserName" style="display: inline;">蒋昆10222859;</span></div><div><strong>Date: </strong><span class="">2025年07月15日 07:46</span></div><div><strong>Subject: </strong><span class="zreadTitle"><strong>Re: [PATCH net-next v4] tcp: extend tcp_retransmit_skb tracepoint with failure reasons</strong></span></div></div><div class="zhistoryContent">On Thu, 10 Jul 2025 10:01:38 +0800 (CST) fan.yu9@....com.cn wrote:<br>> Background<br>> ==========<br>> When TCP retransmits a packet due to missing ACKs, the<br>> retransmission may fail for various reasons (e.g., packets<br>> stuck in driver queues, sequence errors, or routing issues).<br>> <br>> The original tcp_retransmit_skb tracepoint:<br>> 'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")' <br>> lacks visibility into these failure causes, making production<br>> diagnostics difficult.<br>> <br>> Solution<br>> ========<br>> Adds a "result" field to the tcp_retransmit_skb tracepoint,<br>> enumerating with explicit failure cases:<br>> TCP_RETRANS_ERR_DEFAULT (retransmit terminate unexpectedly)<br>> TCP_RETRANS_IN_HOST_QUEUE (packet still queued in driver)<br>> TCP_RETRANS_END_SEQ_ERROR (invalid end sequence)<br>> TCP_RETRANS_NOMEM (retransmit no memory)<br>> TCP_RETRANS_ROUTE_FAIL (routing failure)<br>> TCP_RETRANS_RCV_ZERO_WINDOW (closed receiver window)<br> <br>Have you tried to use this or perform some analysis of which of these<br>reasons actually make sense to add? I'd venture a guess that<br>IN_HOST_QUEUE will dominate in datacenter. Maybe RCV_ZERO_WINDOW<br>can happen. Tracing ENOMEM is a waste of time, so is this:<br> <br> if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {<br> >>>>> WARN_ON_ONCE(1); <<<<<<<< <br>- return -EINVAL;<br>+ result = TCP_RETRANS_END_SEQ_ERROR;<br>-- <br>pw-bot: cr</div></div></div></div><p><br></p></div>
Powered by blists - more mailing lists