netdev - Re: TCP sender stuck despite receiving ACKs from the peer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+=rqOAi3SJ0yj47x9X=ScDX5-dD2GmAVRsVGNP9XDBEw@mail.gmail.com>
Date: Fri, 31 Oct 2025 02:06:49 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Christoph Schwarz <cschwarz@...sta.com>
Cc: Neal Cardwell <ncardwell@...gle.com>, netdev@...r.kernel.org
Subject: Re: TCP sender stuck despite receiving ACKs from the peer

On Thu, Oct 23, 2025 at 10:57 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Oct 23, 2025 at 10:29 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz <cschwarz@...sta.com> wrote:
> > >
> > > On 10/3/25 18:24, Neal Cardwell wrote:
> > > [...]
> > > > Thanks for the report!
> > > >
> > > > A few thoughts:
> > > >
> > > [...]
> > > >
> > > > (2) After that, would it be possible to try this test with a newer
> > > > kernel? You mentioned this is with kernel version 5.10.165, but that's
> > > > more than 2.5 years old at this point, and it's possible the bug has
> > > > been fixed since then.  Could you please try this test with the newest
> > > > kernel that is available in your distribution? (If you are forced to
> > > > use 5.10.x on your distribution, note that even with 5.10.x there is
> > > > v5.10.245, which was released yesterday.)
> > > >
> > > > (3) If this bug is still reproducible with a recent kernel, would it
> > > > be possible to gather .pcap traces from both client and server,
> > > > including SYN and SYN/ACK? Sometimes it can be helpful to see the
> > > > perspective of both ends, especially if there are middleboxes
> > > > manipulating the packets in some way.
> > > >
> > > > Thanks!
> > > >
> > > > Best regards,
> > > > neal
> > >
> > > Hi,
> > >
> > > I want to give an update as we made some progress.
> > >
> > > We tried with the 6.12.40 kernel, but it was much harder to reproduce
> > > and we were not able to do a successful packet capture and reproduction
> > > at the same time. So we went back to 5.10.165, added more tracing and
> > > eventually figured out how the TCP connection got into the bad state.
> > >
> > > This is a backtrace from the TCP stack calling down to the device driver:
> > >   => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
> > >   => dev_hard_start_xmit
> > >   => sch_direct_xmit
> > >   => __qdisc_run
> > >   => __dev_queue_xmit
> > >   => vlan_dev_hard_start_xmit
> > >   => dev_hard_start_xmit
> > >   => __dev_queue_xmit
> > >   => ip_finish_output2
> > >   => __ip_queue_xmit
> > >   => __tcp_transmit_skb
> > >   => tcp_write_xmit
> > >
> > > tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
> > > they get broken down into 45 packets of 1448 bytes each.
> >
> > So the driver does not support TSO ? Quite odd in 2025...
> >
> > One thing you want is to make sure your vlan device (the one without a
> > Qdisc on it)
> > advertizes tso support.
> >
> > ethtool -k vlan0
> >
> >
> > > These 45
> > > packets eventually reach dev_hard_start_xmit, which is a simple loop
> > > forwarding packets one by one. When the problem occurs, we see that
> > > dev_hard_start_xmit transmits the initial N packets successfully, but
> > > the remaining 45-N ones fail with error code 1. The loop runs to
> > > completion and does not break.
> > >
> > > The error code 1 from dev_hard_start_xmit gets returned through the call
> > > stack up to tcp_write_xmit, which treats this as error and breaks its
> > > own loop without advancing snd_nxt:
> > >
> > >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> > >                         break; // <<< breaks here
> > >
> > > repair:
> > >                 /* Advance the send_head.  This one is sent out.
> > >                  * This call will increment packets_out.
> > >                  */
> > >                 tcp_event_new_data_sent(sk, skb);
> > >
> > >  From packet captures we can prove that the 45 packets show up on the
> > > kernel device on the sender. In addition, the first N of those 45
> > > packets show up on the kernel device on the peer. The connection is now
> > > in the problem state where the peer is N packets ahead of the sender and
> > > the sender thinks that it never those packets, leading to the problem as
> > > described in my initial mail.
> > >
> > > Furthermore, we noticed that the N-45 missing packets show up as drops
> > > on the sender's kernel device:
> > >
> > > vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> > >          inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
> > >          [...]
> > >          TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0
> > >
> > > This device is a vlan device stacked on another device like this:
> > >
> > > 49: vlan0@...ent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> > > noqueue state UP mode DEFAULT group default qlen 1000
> > >      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
> > > 3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
> > > UNKNOWN mode DEFAULT group default qlen 1000
> > >      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> > >
> > > Eventually packets need to go through the device driver, which has only
> > > a limited number of TX buffers. The driver implements flow control: when
> > > it is about to exhaust its buffers, it stops TX by calling
> > > netif_stop_queue. Once more buffers become available again, it resumes
> > > TX by calling netif_wake_queue. From packet counters we can tell that
> > > this is happening frequently.
> > >
> > > At this point we suspected "qdisc noqueue" to be a factor, and indeed,
> > > after adding a queue to vlan0 the problem no longer happened, although
> > > there are still TX drops on the vlan0 device.
> > >
> > > Missing queue or not, we think there is a disconnect between the device
> > > driver API and the TCP stack. The device driver API only allows
> > > transmitting packets one by one (ndo_start_xmit). The TCP stack operates
> > > on larger segments that is breaks down into smaller pieces
> > > (tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
> > > write" condition which the network stack doesn't seem to handle well in
> > > all cases.
> > >
> > > Appreciate you comments,
> >
> > Very nice analysis, very much appreciated.
> >
> > I think the issue here is that __tcp_transmit_skb() trusts the return
> > of icsk->icsk_af_ops->queue_xmit()
> >
> > An error means : the packet was _not_ sent at all.
> >
> > Here, it seems that the GSO layer returns an error, even if some
> > segments were sent.
> > This needs to be confirmed and fixed, but in the meantime, make sure
> > vlan0 has TSO support.
> > It will also be more efficient to segment (if you ethernet device has
> > no TSO capability) at the last moment,
> > because all the segments will be sent in  the described scenario
> > thanks to qdisc requeues.
>
> Could you try the following patch ?
>
> Thanks again !
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>                  * to -1 or to their cpu id, but not to our id.
>                  */
>                 if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> +                       struct sk_buff *orig;
> +
>                         if (dev_xmit_recursion())
>                                 goto recursion_alert;
>
> @@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>
>                         HARD_TX_LOCK(dev, txq, cpu);
>
> +                       orig = skb;
>                         if (!netif_xmit_stopped(txq)) {
>                                 dev_xmit_recursion_inc();
>                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
> @@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,
> struct net_device *sb_dev)
>                         HARD_TX_UNLOCK(dev, txq);
>                         net_crit_ratelimited("Virtual device %s asks
> to queue packet!\n",
>                                              dev->name);
> +                       if (skb != orig) {
> +                               /* If at least one packet was sent, we
> must return NETDEV_TX_OK */
> +                               rc = NETDEV_TX_OK;
> +                               goto unlock;
> +                       }
>                 } else {
>                         /* Recursion is detected! It is possible,
>                          * unfortunately
> @@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>         }
>
>         rc = -ENETDOWN;
> +unlock:
>         rcu_read_unlock_bh();
>
>         dev_core_stats_tx_dropped_inc(dev);

Hi Christoph

Any progress on your side ?

Thanks.