linux-kernel - Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <befd947e-8725-4637-8fac-6a364b0b4df0.peng.yu@alibaba-inc.com>
Date: Mon, 20 Oct 2025 07:00:07 +0800
From: "YU, Peng" <peng.yu@...baba-inc.com>
To: "Eric Dumazet" <edumazet@...gle.com>,
  "Peng Yu" <yupeng0921@...il.com>
Cc: "ncardwell" <ncardwell@...gle.com>,
  "kuniyu" <kuniyu@...gle.com>,
  "netdev" <netdev@...r.kernel.org>,
  "linux-kernel" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

I think we know the root cause in the driver. We are using the
virtio_net driver. We found that the issue happens after this driver
commit:

b92f1e6751a6 virtio-net: transmit napi

According to our test, the issue will happen if we apply below change:


 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
        struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
        int err;
        struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
        bool kick = !skb->xmit_more;
+       bool use_napi = sq->napi.weight;

        /* Free up any pending old buffers before queueing new ones. */
        free_old_xmit_skbs(sq);
@@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
        }

        /* Don't wait up for transmitted skbs to be freed. */
-       skb_orphan(skb);
-       nf_reset(skb);
+       if (!use_napi) {
+               skb_orphan(skb);
+               nf_reset(skb);
+       }


Before this change, the driver will invoke skb_orphan immediately when
it receives a skb, then the tcp layer will decrease the wmem_alloc.
Thus the small queue check won't fail. After applying this change, the
virtio_net driver will tell tcp layer to decrease the wmem_alloc when
the skp is really sent out.
If we set use_napi to false, the virtio_net driver will invoke
skb_orphan immediately as before, then the issue won't happen.
But invoking skb_orphan in start_xmit looks like a workaround to me,
I'm not sure if we should rollback this change.  The small queue check
and cwnd window would come into a kind of "dead lock" situation to me,
so I suppose we should fix that "dead lock".  If you believe we
shouldn't change TCP layer for this issue, may I know the correct
direction to resolve this issue? Should we modify the virtio_net
driver, let it always invoke skb_orphan as before?
As a workaround, we set the virtio_net module parameter napi_tx to
false, then the use_napi would be false too. Thus the issue won't
happen. But we indeed want to enable napi_tx, so may I know what's
your suggestion about this issue?


------------------------------------------------------------------
From:Eric Dumazet <edumazet@...gle.com>
Send Time:2025 Oct. 20 (Mon.) 01:43
To:Peng Yu<yupeng0921@...il.com>
CC:ncardwell<ncardwell@...gle.com>; kuniyu<kuniyu@...gle.com>; netdev<netdev@...r.kernel.org>; "linux-kernel"<linux-kernel@...r.kernel.org>; Peng YU<peng.yu@...baba-inc.com>
Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails


On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@...il.com> wrote:
>
> The limit of the small queue check is calculated from the pacing rate,
> the pacing rate is calculated from the cwnd. If the cwnd is small,
> the small queue check may fail.
> When the samll queue check fails, the tcp layer will send less
> packages, then the tcp_is_cwnd_limited would alreays return false,
> then the cwnd would have no chance to get updated.
> The cwnd has no chance to get updated, it keeps small, then the pacing
> rate keeps small, and the limit of the small queue check keeps small,
> then the small queue check would always fail.
> It is a kind of dead lock, when a tcp flow comes into this situation,
> it's throughput would be very small, obviously less then the correct
> throughput it should have.
> We set is_cwnd_limited to true when the small queue check fails, then
> the cwnd would have a chance to get updated, then we can break this
> deadlock.
>
> Below ss output shows this issue:
>
> skmem:(r0,rb131072,
> t7712, <------------------------------ wmem_alloc = 7712
> tb243712,f2128,w219056,o0,bl0,d0)
> ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> pmtu:8500 rcvmss:536 advmss:8448
> cwnd:28 <------------------------------ cwnd=28
> bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> minrtt:23.319 snd_wnd:57088
>
> limit=(27764216 / 8) / 1024 = 3389 < 7712
> So the samll queue check fails. When it happens, the throughput is
> obviously less than the normal situation.
>
> By setting the tcp_is_cwnd_limited to true when the small queue check
> failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> size, in my test environment, it is about 4000. Then the small queue
> check won't fail.


>
> Signed-off-by: Peng Yu <peng.yu@...baba-inc.com>
> ---
>  net/ipv4/tcp_output.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b94efb3050d2..8c70acf3a060 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
>                         break;
>
> -               if (tcp_small_queue_check(sk, skb, 0))
> +               if (tcp_small_queue_check(sk, skb, 0)) {
> +                       is_cwnd_limited = true;
>                         break;
> +               }
>
>                 /* Argh, we hit an empty skb(), presumably a thread
>                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> --
> 2.47.3

Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.

Something else is preventing your flows to get to nominal speed,
because we have not seen anything like that.

It is probably a driver issue or a receive side issue : Instead of
trying to work around the issue, please root cause it.