netdev - Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoB3veWSGcsd6Dw=3Z8SsUYyO4Os4fUh7Cy=87JVzeC3sw@mail.gmail.com>
Date: Mon, 20 Oct 2025 12:02:48 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: "YU, Peng" <peng.yu@...baba-inc.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Peng Yu <yupeng0921@...il.com>, 
	ncardwell <ncardwell@...gle.com>, kuniyu <kuniyu@...gle.com>, 
	netdev <netdev@...r.kernel.org>, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

On Mon, Oct 20, 2025 at 7:00 AM YU, Peng <peng.yu@...baba-inc.com> wrote:
>
> I think we know the root cause in the driver. We are using the
> virtio_net driver. We found that the issue happens after this driver
> commit:
>
> b92f1e6751a6 virtio-net: transmit napi
>
> According to our test, the issue will happen if we apply below change:
>
>
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
>         struct virtio_net_hdr_mrg_rxbuf *hdr;
> @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         int err;
>         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>         bool kick = !skb->xmit_more;
> +       bool use_napi = sq->napi.weight;
>
>         /* Free up any pending old buffers before queueing new ones. */
>         free_old_xmit_skbs(sq);
> @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>
>         /* Don't wait up for transmitted skbs to be freed. */
> -       skb_orphan(skb);
> -       nf_reset(skb);
> +       if (!use_napi) {
> +               skb_orphan(skb);
> +               nf_reset(skb);
> +       }
>
>
> Before this change, the driver will invoke skb_orphan immediately when
> it receives a skb, then the tcp layer will decrease the wmem_alloc.
> Thus the small queue check won't fail. After applying this change, the
> virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> the skp is really sent out.
> If we set use_napi to false, the virtio_net driver will invoke
> skb_orphan immediately as before, then the issue won't happen.

Very classic and annoying issues that I believe not few people have
encountered... We eventually resorted to orphan mode just as you did,
FYI.

But as to fixing it thoroughly, I agree with Eric's thoughts.

> But invoking skb_orphan in start_xmit looks like a workaround to me,
> I'm not sure if we should rollback this change.  The small queue check
> and cwnd window would come into a kind of "dead lock" situation to me,
> so I suppose we should fix that "dead lock".  If you believe we
> shouldn't change TCP layer for this issue, may I know the correct
> direction to resolve this issue? Should we modify the virtio_net
> driver, let it always invoke skb_orphan as before?

Have you ever noticed the issue cannot be reliably reproduced? It's
only randomly happening on some VMs without any reason. Once I suspect
the policy/logic from the host side might have some problems, like too
slow frequency of triggering IRQ.

Thanks,
Jason

> As a workaround, we set the virtio_net module parameter napi_tx to
> false, then the use_napi would be false too. Thus the issue won't
> happen. But we indeed want to enable napi_tx, so may I know what's
> your suggestion about this issue?
>
>
> ------------------------------------------------------------------
> From:Eric Dumazet <edumazet@...gle.com>
> Send Time:2025 Oct. 20 (Mon.) 01:43
> To:Peng Yu<yupeng0921@...il.com>
> CC:ncardwell<ncardwell@...gle.com>; kuniyu<kuniyu@...gle.com>; netdev<netdev@...r.kernel.org>; "linux-kernel"<linux-kernel@...r.kernel.org>; Peng YU<peng.yu@...baba-inc.com>
> Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
>
>
> On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@...il.com> wrote:
> >
> > The limit of the small queue check is calculated from the pacing rate,
> > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > the small queue check may fail.
> > When the samll queue check fails, the tcp layer will send less
> > packages, then the tcp_is_cwnd_limited would alreays return false,
> > then the cwnd would have no chance to get updated.
> > The cwnd has no chance to get updated, it keeps small, then the pacing
> > rate keeps small, and the limit of the small queue check keeps small,
> > then the small queue check would always fail.
> > It is a kind of dead lock, when a tcp flow comes into this situation,
> > it's throughput would be very small, obviously less then the correct
> > throughput it should have.
> > We set is_cwnd_limited to true when the small queue check fails, then
> > the cwnd would have a chance to get updated, then we can break this
> > deadlock.
> >
> > Below ss output shows this issue:
> >
> > skmem:(r0,rb131072,
> > t7712, <------------------------------ wmem_alloc = 7712
> > tb243712,f2128,w219056,o0,bl0,d0)
> > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > pmtu:8500 rcvmss:536 advmss:8448
> > cwnd:28 <------------------------------ cwnd=28
> > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > minrtt:23.319 snd_wnd:57088
> >
> > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > So the samll queue check fails. When it happens, the throughput is
> > obviously less than the normal situation.
> >
> > By setting the tcp_is_cwnd_limited to true when the small queue check
> > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > size, in my test environment, it is about 4000. Then the small queue
> > check won't fail.
>
>
> >
> > Signed-off-by: Peng Yu <peng.yu@...baba-inc.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b94efb3050d2..8c70acf3a060 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> >                         break;
> >
> > -               if (tcp_small_queue_check(sk, skb, 0))
> > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > +                       is_cwnd_limited = true;
> >                         break;
> > +               }
> >
> >                 /* Argh, we hit an empty skb(), presumably a thread
> >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > --
> > 2.47.3
>
> Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
>
> Something else is preventing your flows to get to nominal speed,
> because we have not seen anything like that.
>
> It is probably a driver issue or a receive side issue : Instead of
> trying to work around the issue, please root cause it.