netdev - Re: TCP sender stuck despite receiving ACKs from the peer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com>
Date: Thu, 23 Oct 2025 22:57:36 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Christoph Schwarz <cschwarz@...sta.com>
Cc: Neal Cardwell <ncardwell@...gle.com>, netdev@...r.kernel.org
Subject: Re: TCP sender stuck despite receiving ACKs from the peer

On Thu, Oct 23, 2025 at 10:29 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz <cschwarz@...sta.com> wrote:
> >
> > On 10/3/25 18:24, Neal Cardwell wrote:
> > [...]
> > > Thanks for the report!
> > >
> > > A few thoughts:
> > >
> > [...]
> > >
> > > (2) After that, would it be possible to try this test with a newer
> > > kernel? You mentioned this is with kernel version 5.10.165, but that's
> > > more than 2.5 years old at this point, and it's possible the bug has
> > > been fixed since then.  Could you please try this test with the newest
> > > kernel that is available in your distribution? (If you are forced to
> > > use 5.10.x on your distribution, note that even with 5.10.x there is
> > > v5.10.245, which was released yesterday.)
> > >
> > > (3) If this bug is still reproducible with a recent kernel, would it
> > > be possible to gather .pcap traces from both client and server,
> > > including SYN and SYN/ACK? Sometimes it can be helpful to see the
> > > perspective of both ends, especially if there are middleboxes
> > > manipulating the packets in some way.
> > >
> > > Thanks!
> > >
> > > Best regards,
> > > neal
> >
> > Hi,
> >
> > I want to give an update as we made some progress.
> >
> > We tried with the 6.12.40 kernel, but it was much harder to reproduce
> > and we were not able to do a successful packet capture and reproduction
> > at the same time. So we went back to 5.10.165, added more tracing and
> > eventually figured out how the TCP connection got into the bad state.
> >
> > This is a backtrace from the TCP stack calling down to the device driver:
> >   => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
> >   => dev_hard_start_xmit
> >   => sch_direct_xmit
> >   => __qdisc_run
> >   => __dev_queue_xmit
> >   => vlan_dev_hard_start_xmit
> >   => dev_hard_start_xmit
> >   => __dev_queue_xmit
> >   => ip_finish_output2
> >   => __ip_queue_xmit
> >   => __tcp_transmit_skb
> >   => tcp_write_xmit
> >
> > tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
> > they get broken down into 45 packets of 1448 bytes each.
>
> So the driver does not support TSO ? Quite odd in 2025...
>
> One thing you want is to make sure your vlan device (the one without a
> Qdisc on it)
> advertizes tso support.
>
> ethtool -k vlan0
>
>
> > These 45
> > packets eventually reach dev_hard_start_xmit, which is a simple loop
> > forwarding packets one by one. When the problem occurs, we see that
> > dev_hard_start_xmit transmits the initial N packets successfully, but
> > the remaining 45-N ones fail with error code 1. The loop runs to
> > completion and does not break.
> >
> > The error code 1 from dev_hard_start_xmit gets returned through the call
> > stack up to tcp_write_xmit, which treats this as error and breaks its
> > own loop without advancing snd_nxt:
> >
> >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> >                         break; // <<< breaks here
> >
> > repair:
> >                 /* Advance the send_head.  This one is sent out.
> >                  * This call will increment packets_out.
> >                  */
> >                 tcp_event_new_data_sent(sk, skb);
> >
> >  From packet captures we can prove that the 45 packets show up on the
> > kernel device on the sender. In addition, the first N of those 45
> > packets show up on the kernel device on the peer. The connection is now
> > in the problem state where the peer is N packets ahead of the sender and
> > the sender thinks that it never those packets, leading to the problem as
> > described in my initial mail.
> >
> > Furthermore, we noticed that the N-45 missing packets show up as drops
> > on the sender's kernel device:
> >
> > vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >          inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
> >          [...]
> >          TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0
> >
> > This device is a vlan device stacked on another device like this:
> >
> > 49: vlan0@...ent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> > noqueue state UP mode DEFAULT group default qlen 1000
> >      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
> > 3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
> > UNKNOWN mode DEFAULT group default qlen 1000
> >      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> >
> > Eventually packets need to go through the device driver, which has only
> > a limited number of TX buffers. The driver implements flow control: when
> > it is about to exhaust its buffers, it stops TX by calling
> > netif_stop_queue. Once more buffers become available again, it resumes
> > TX by calling netif_wake_queue. From packet counters we can tell that
> > this is happening frequently.
> >
> > At this point we suspected "qdisc noqueue" to be a factor, and indeed,
> > after adding a queue to vlan0 the problem no longer happened, although
> > there are still TX drops on the vlan0 device.
> >
> > Missing queue or not, we think there is a disconnect between the device
> > driver API and the TCP stack. The device driver API only allows
> > transmitting packets one by one (ndo_start_xmit). The TCP stack operates
> > on larger segments that is breaks down into smaller pieces
> > (tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
> > write" condition which the network stack doesn't seem to handle well in
> > all cases.
> >
> > Appreciate you comments,
>
> Very nice analysis, very much appreciated.
>
> I think the issue here is that __tcp_transmit_skb() trusts the return
> of icsk->icsk_af_ops->queue_xmit()
>
> An error means : the packet was _not_ sent at all.
>
> Here, it seems that the GSO layer returns an error, even if some
> segments were sent.
> This needs to be confirmed and fixed, but in the meantime, make sure
> vlan0 has TSO support.
> It will also be more efficient to segment (if you ethernet device has
> no TSO capability) at the last moment,
> because all the segments will be sent in  the described scenario
> thanks to qdisc requeues.

Could you try the following patch ?

Thanks again !

diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
                 * to -1 or to their cpu id, but not to our id.
                 */
                if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+                       struct sk_buff *orig;
+
                        if (dev_xmit_recursion())
                                goto recursion_alert;

@@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)

                        HARD_TX_LOCK(dev, txq, cpu);

+                       orig = skb;
                        if (!netif_xmit_stopped(txq)) {
                                dev_xmit_recursion_inc();
                                skb = dev_hard_start_xmit(skb, dev, txq, &rc);
@@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,
struct net_device *sb_dev)
                        HARD_TX_UNLOCK(dev, txq);
                        net_crit_ratelimited("Virtual device %s asks
to queue packet!\n",
                                             dev->name);
+                       if (skb != orig) {
+                               /* If at least one packet was sent, we
must return NETDEV_TX_OK */
+                               rc = NETDEV_TX_OK;
+                               goto unlock;
+                       }
                } else {
                        /* Recursion is detected! It is possible,
                         * unfortunately
@@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
        }

        rc = -ENETDOWN;
+unlock:
        rcu_read_unlock_bh();

        dev_core_stats_tx_dropped_inc(dev);