netdev - Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJN4V8SeythtQVrSjhztWmCySdAxR8h35i4Ea2ceq9k8w@mail.gmail.com>
Date: Sun, 19 Oct 2025 20:59:54 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: "YU, Peng" <peng.yu@...baba-inc.com>
Cc: Peng Yu <yupeng0921@...il.com>, ncardwell <ncardwell@...gle.com>, 
	kuniyu <kuniyu@...gle.com>, netdev <netdev@...r.kernel.org>, 
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

On Sun, Oct 19, 2025 at 4:00 PM YU, Peng <peng.yu@...baba-inc.com> wrote:
>
> I think we know the root cause in the driver. We are using the
> virtio_net driver. We found that the issue happens after this driver
> commit:
>
> b92f1e6751a6 virtio-net: transmit napi
>
> According to our test, the issue will happen if we apply below change:
>
>
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
>         struct virtio_net_hdr_mrg_rxbuf *hdr;
> @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         int err;
>         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>         bool kick = !skb->xmit_more;
> +       bool use_napi = sq->napi.weight;
>
>         /* Free up any pending old buffers before queueing new ones. */
>         free_old_xmit_skbs(sq);
> @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>
>         /* Don't wait up for transmitted skbs to be freed. */
> -       skb_orphan(skb);
> -       nf_reset(skb);
> +       if (!use_napi) {
> +               skb_orphan(skb);
> +               nf_reset(skb);
> +       }
>
>
> Before this change, the driver will invoke skb_orphan immediately when
> it receives a skb, then the tcp layer will decrease the wmem_alloc.
> Thus the small queue check won't fail. After applying this change, the
> virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> the skp is really sent out.
> If we set use_napi to false, the virtio_net driver will invoke
> skb_orphan immediately as before, then the issue won't happen.
> But invoking skb_orphan in start_xmit looks like a workaround to me,
> I'm not sure if we should rollback this change.  The small queue check
> and cwnd window would come into a kind of "dead lock" situation to me,
> so I suppose we should fix that "dead lock".  If you believe we
> shouldn't change TCP layer for this issue, may I know the correct
> direction to resolve this issue? Should we modify the virtio_net
> driver, let it always invoke skb_orphan as before?
> As a workaround, we set the virtio_net module parameter napi_tx to
> false, then the use_napi would be false too. Thus the issue won't
> happen. But we indeed want to enable napi_tx, so may I know what's
> your suggestion about this issue?
>

I think you should start a conversation with virtio_net experts,
instead of making TCP
bufferbloated again.

TX completions dynamics are important, and we are not going to
penalize all drivers
just because of one.

You are claiming deadlocks, but the mechanisms in place are proven to
work damn well.

>
> ------------------------------------------------------------------
> From:Eric Dumazet <edumazet@...gle.com>
> Send Time:2025 Oct. 20 (Mon.) 01:43
> To:Peng Yu<yupeng0921@...il.com>
> CC:ncardwell<ncardwell@...gle.com>; kuniyu<kuniyu@...gle.com>; netdev<netdev@...r.kernel.org>; "linux-kernel"<linux-kernel@...r.kernel.org>; Peng YU<peng.yu@...baba-inc.com>
> Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
>
>
> On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@...il.com> wrote:
> >
> > The limit of the small queue check is calculated from the pacing rate,
> > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > the small queue check may fail.
> > When the samll queue check fails, the tcp layer will send less
> > packages, then the tcp_is_cwnd_limited would alreays return false,
> > then the cwnd would have no chance to get updated.
> > The cwnd has no chance to get updated, it keeps small, then the pacing
> > rate keeps small, and the limit of the small queue check keeps small,
> > then the small queue check would always fail.
> > It is a kind of dead lock, when a tcp flow comes into this situation,
> > it's throughput would be very small, obviously less then the correct
> > throughput it should have.
> > We set is_cwnd_limited to true when the small queue check fails, then
> > the cwnd would have a chance to get updated, then we can break this
> > deadlock.
> >
> > Below ss output shows this issue:
> >
> > skmem:(r0,rb131072,
> > t7712, <------------------------------ wmem_alloc = 7712
> > tb243712,f2128,w219056,o0,bl0,d0)
> > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > pmtu:8500 rcvmss:536 advmss:8448
> > cwnd:28 <------------------------------ cwnd=28
> > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > minrtt:23.319 snd_wnd:57088
> >
> > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > So the samll queue check fails. When it happens, the throughput is
> > obviously less than the normal situation.
> >
> > By setting the tcp_is_cwnd_limited to true when the small queue check
> > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > size, in my test environment, it is about 4000. Then the small queue
> > check won't fail.
>
>
> >
> > Signed-off-by: Peng Yu <peng.yu@...baba-inc.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b94efb3050d2..8c70acf3a060 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> >                         break;
> >
> > -               if (tcp_small_queue_check(sk, skb, 0))
> > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > +                       is_cwnd_limited = true;
> >                         break;
> > +               }
> >
> >                 /* Argh, we hit an empty skb(), presumably a thread
> >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > --
> > 2.47.3
>
> Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
>
> Something else is preventing your flows to get to nominal speed,
> because we have not seen anything like that.
>
> It is probably a driver issue or a receive side issue : Instead of
> trying to work around the issue, please root cause it.