netdev - Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoD8akZPp87CFnc2d98eZGvXheNeUEJBZcJEaqHZst4DFQ@mail.gmail.com>
Date: Mon, 20 Oct 2025 12:04:10 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "YU, Peng" <peng.yu@...baba-inc.com>, Peng Yu <yupeng0921@...il.com>, 
	ncardwell <ncardwell@...gle.com>, kuniyu <kuniyu@...gle.com>, 
	netdev <netdev@...r.kernel.org>, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

On Mon, Oct 20, 2025 at 12:00 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Sun, Oct 19, 2025 at 4:00 PM YU, Peng <peng.yu@...baba-inc.com> wrote:
> >
> > I think we know the root cause in the driver. We are using the
> > virtio_net driver. We found that the issue happens after this driver
> > commit:
> >
> > b92f1e6751a6 virtio-net: transmit napi
> >
> > According to our test, the issue will happen if we apply below change:
> >
> >
> >  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> >  {
> >         struct virtio_net_hdr_mrg_rxbuf *hdr;
> > @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         int err;
> >         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> >         bool kick = !skb->xmit_more;
> > +       bool use_napi = sq->napi.weight;
> >
> >         /* Free up any pending old buffers before queueing new ones. */
> >         free_old_xmit_skbs(sq);
> > @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         }
> >
> >         /* Don't wait up for transmitted skbs to be freed. */
> > -       skb_orphan(skb);
> > -       nf_reset(skb);
> > +       if (!use_napi) {
> > +               skb_orphan(skb);
> > +               nf_reset(skb);
> > +       }
> >
> >
> > Before this change, the driver will invoke skb_orphan immediately when
> > it receives a skb, then the tcp layer will decrease the wmem_alloc.
> > Thus the small queue check won't fail. After applying this change, the
> > virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> > the skp is really sent out.
> > If we set use_napi to false, the virtio_net driver will invoke
> > skb_orphan immediately as before, then the issue won't happen.
> > But invoking skb_orphan in start_xmit looks like a workaround to me,
> > I'm not sure if we should rollback this change.  The small queue check
> > and cwnd window would come into a kind of "dead lock" situation to me,
> > so I suppose we should fix that "dead lock".  If you believe we
> > shouldn't change TCP layer for this issue, may I know the correct
> > direction to resolve this issue? Should we modify the virtio_net
> > driver, let it always invoke skb_orphan as before?
> > As a workaround, we set the virtio_net module parameter napi_tx to
> > false, then the use_napi would be false too. Thus the issue won't
> > happen. But we indeed want to enable napi_tx, so may I know what's
> > your suggestion about this issue?
> >
>
> I think you should start a conversation with virtio_net experts,
> instead of making TCP
> bufferbloated again.
>
> TX completions dynamics are important, and we are not going to
> penalize all drivers
> just because of one.
>
> You are claiming deadlocks, but the mechanisms in place are proven to
> work damn well.

Oh, we're almost at the same time to reply to the thread :)

Right, I'm totally with you.

Thanks,
Jason

>
> >
> > ------------------------------------------------------------------
> > From:Eric Dumazet <edumazet@...gle.com>
> > Send Time:2025 Oct. 20 (Mon.) 01:43
> > To:Peng Yu<yupeng0921@...il.com>
> > CC:ncardwell<ncardwell@...gle.com>; kuniyu<kuniyu@...gle.com>; netdev<netdev@...r.kernel.org>; "linux-kernel"<linux-kernel@...r.kernel.org>; Peng YU<peng.yu@...baba-inc.com>
> > Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
> >
> >
> > On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@...il.com> wrote:
> > >
> > > The limit of the small queue check is calculated from the pacing rate,
> > > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > > the small queue check may fail.
> > > When the samll queue check fails, the tcp layer will send less
> > > packages, then the tcp_is_cwnd_limited would alreays return false,
> > > then the cwnd would have no chance to get updated.
> > > The cwnd has no chance to get updated, it keeps small, then the pacing
> > > rate keeps small, and the limit of the small queue check keeps small,
> > > then the small queue check would always fail.
> > > It is a kind of dead lock, when a tcp flow comes into this situation,
> > > it's throughput would be very small, obviously less then the correct
> > > throughput it should have.
> > > We set is_cwnd_limited to true when the small queue check fails, then
> > > the cwnd would have a chance to get updated, then we can break this
> > > deadlock.
> > >
> > > Below ss output shows this issue:
> > >
> > > skmem:(r0,rb131072,
> > > t7712, <------------------------------ wmem_alloc = 7712
> > > tb243712,f2128,w219056,o0,bl0,d0)
> > > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > > pmtu:8500 rcvmss:536 advmss:8448
> > > cwnd:28 <------------------------------ cwnd=28
> > > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > > minrtt:23.319 snd_wnd:57088
> > >
> > > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > > So the samll queue check fails. When it happens, the throughput is
> > > obviously less than the normal situation.
> > >
> > > By setting the tcp_is_cwnd_limited to true when the small queue check
> > > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > > size, in my test environment, it is about 4000. Then the small queue
> > > check won't fail.
> >
> >
> > >
> > > Signed-off-by: Peng Yu <peng.yu@...baba-inc.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index b94efb3050d2..8c70acf3a060 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> > >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> > >                         break;
> > >
> > > -               if (tcp_small_queue_check(sk, skb, 0))
> > > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > > +                       is_cwnd_limited = true;
> > >                         break;
> > > +               }
> > >
> > >                 /* Argh, we hit an empty skb(), presumably a thread
> > >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > > --
> > > 2.47.3
> >
> > Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
> >
> > Something else is preventing your flows to get to nominal speed,
> > because we have not seen anything like that.
> >
> > It is probably a driver issue or a receive side issue : Instead of
> > trying to work around the issue, please root cause it.
>