netdev - Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iKpZ5aLNpv66B9M4R1d_Pn5ZX=8-XaiyCLgKRy3marUtQ@mail.gmail.com>
Date: Thu, 5 Jun 2025 15:11:43 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	John Fastabend <john.fastabend@...il.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, linux-kernel@...r.kernel.org
Subject: Re: [BUG] veth: TX drops with NAPI enabled and crash in combination
 with qdisc

On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >
> > Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de> writes:
> >
> > > Hi,
> > >
> > > while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> > > noticed that the veth-pair looses lots of packets when multiple TCP streams go
> > > through it, resulting in stalling TCP connections and noticeable instabilities.
> > >
> > > This doesn't seem to be an issue with just XDP but rather occurs whenever the
> > > NAPI mode of the veth driver is active.
> > > I managed to reproduce the same behavior just by bringing the veth-pair into
> > > NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> > > XDP")) and running multiple TCP streams through it using a network namespace.
> > >
> > > Here is how I reproduced it:
> > >
> > >   ip netns add lb
> > >   ip link add dev to-lb type veth peer name in-lb netns lb
> > >
> > >   # Enable NAPI
> > >   ethtool -K to-lb gro on
> > >   ethtool -K to-lb tso off
> > >   ip netns exec lb ethtool -K in-lb gro on
> > >   ip netns exec lb ethtool -K in-lb tso off
> > >
> > >   ip link set dev to-lb up
> > >   ip -netns lb link set dev in-lb up
> > >
> > > Then run a HTTP server inside the "lb" namespace that serves a large file:
> > >
> > >   fallocate -l 10G testfiles/10GB.bin
> > >   caddy file-server --root testfiles/
> > >
> > > Download this file from within the root namespace multiple times in parallel:
> > >
> > >   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> > >
> > > In my tests, I ran four parallel curls at the same time and after just a few
> > > seconds, three of them stalled while the other one "won" over the full bandwidth
> > > and completed the download.
> > >
> > > This is probably a result of the veth's ptr_ring running full, causing many
> > > packet drops on TX, and the TCP congestion control reacting to that.
> > >
> > > In this context, I also took notice of Jesper's patch which describes a very
> > > similar issue and should help to resolve this:
> > >   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> > >   reduce TX drops")
> > >
> > > But when repeating the above test with latest mainline, which includes this
> > > patch, and enabling qdisc via
> > >   tc qdisc add dev in-lb root sfq perturb 10
> > > the Kernel crashed just after starting the second TCP stream (see output below).
> > >
> > > So I have two questions:
> > > - Is my understanding of the described issue correct and is Jesper's patch
> > >   sufficient to solve this?
> >
> > Hmm, yeah, this does sound likely.
> >
> > > - Is my qdisc configuration to make use of this patch correct and the kernel
> > >   crash is likely a bug?
> > >
> > > ------------[ cut here ]------------
> > > UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> > > index 65535 is out of range for type 'sfq_head [128]'
> >
> > This (the 'index 65535') kinda screams "integer underflow". So certainly
> > looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> > patch would trigger this; maybe Eric has an idea?
> >
> > Does this happen with other qdiscs as well, or is it specific to sfq?
>
> This seems like a bug in sfq, we already had recent fixes in it, and
> other fixes in net/sched vs qdisc_tree_reduce_backlog()
>
> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)

This seems to be a very old bug, indeed caused by sch->gso_skb
contribution to sch->q.qlen

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
struct sk_buff **to_free)
                /* It is difficult to believe, but ALL THE SLOTS HAVE
LENGTH 1. */
                x = q->tail->next;
                slot = &q->slots[x];
-               q->tail->next = slot->next;
+               if (slot->next == x)
+                       q->tail = NULL; /* no more active slots */
+               else
+                       q->tail->next = slot->next;
                q->ht[slot->hash] = SFQ_EMPTY_SLOT;
                goto drop;
        }