linux-kernel - Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iKFm1ER904sUUh5v_e29QvkFAZXe4yOJfeoo9VLx616iA@mail.gmail.com>
Date: Tue, 10 Jun 2025 07:49:24 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
Cc: Toke Høiland-Jørgensen <toke@...hat.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	John Fastabend <john.fastabend@...il.com>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, linux-kernel@...r.kernel.org
Subject: Re: [BUG] veth: TX drops with NAPI enabled and crash in combination
 with qdisc

On Tue, Jun 10, 2025 at 7:41 AM Marcus Wichelmann
<marcus.wichelmann@...zner-cloud.de> wrote:
>
> Am 06.06.25 um 11:06 schrieb Eric Dumazet:
> > On Thu, Jun 5, 2025 at 3:17 PM Marcus Wichelmann
> > <marcus.wichelmann@...zner-cloud.de> wrote:
> >>
> >> Am 06.06.25 um 00:11 schrieb Eric Dumazet:
> >>> On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >>>>
> >>>> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >>>>>
> >>>>> Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de> writes:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
> >>>>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
> >>>>>> through it, resulting in stalling TCP connections and noticeable instabilities.
> >>>>>>
> >>>>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
> >>>>>> NAPI mode of the veth driver is active.
> >>>>>> I managed to reproduce the same behavior just by bringing the veth-pair into
> >>>>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
> >>>>>> XDP")) and running multiple TCP streams through it using a network namespace.
> >>>>>>
> >>>>>> Here is how I reproduced it:
> >>>>>>
> >>>>>>   ip netns add lb
> >>>>>>   ip link add dev to-lb type veth peer name in-lb netns lb
> >>>>>>
> >>>>>>   # Enable NAPI
> >>>>>>   ethtool -K to-lb gro on
> >>>>>>   ethtool -K to-lb tso off
> >>>>>>   ip netns exec lb ethtool -K in-lb gro on
> >>>>>>   ip netns exec lb ethtool -K in-lb tso off
> >>>>>>
> >>>>>>   ip link set dev to-lb up
> >>>>>>   ip -netns lb link set dev in-lb up
> >>>>>>
> >>>>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
> >>>>>>
> >>>>>>   fallocate -l 10G testfiles/10GB.bin
> >>>>>>   caddy file-server --root testfiles/
> >>>>>>
> >>>>>> Download this file from within the root namespace multiple times in parallel:
> >>>>>>
> >>>>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
> >>>>>>
> >>>>>> In my tests, I ran four parallel curls at the same time and after just a few
> >>>>>> seconds, three of them stalled while the other one "won" over the full bandwidth
> >>>>>> and completed the download.
> >>>>>>
> >>>>>> This is probably a result of the veth's ptr_ring running full, causing many
> >>>>>> packet drops on TX, and the TCP congestion control reacting to that.
> >>>>>>
> >>>>>> In this context, I also took notice of Jesper's patch which describes a very
> >>>>>> similar issue and should help to resolve this:
> >>>>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
> >>>>>>   reduce TX drops")
> >>>>>>
> >>>>>> But when repeating the above test with latest mainline, which includes this
> >>>>>> patch, and enabling qdisc via
> >>>>>>   tc qdisc add dev in-lb root sfq perturb 10
> >>>>>> the Kernel crashed just after starting the second TCP stream (see output below).
> >>>>>>
> >>>>>> So I have two questions:
> >>>>>> - Is my understanding of the described issue correct and is Jesper's patch
> >>>>>>   sufficient to solve this?
> >>>>>
> >>>>> Hmm, yeah, this does sound likely.
> >>>>>
> >>>>>> - Is my qdisc configuration to make use of this patch correct and the kernel
> >>>>>>   crash is likely a bug?
> >>>>>>
> >>>>>> ------------[ cut here ]------------
> >>>>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
> >>>>>> index 65535 is out of range for type 'sfq_head [128]'
> >>>>>
> >>>>> This (the 'index 65535') kinda screams "integer underflow". So certainly
> >>>>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
> >>>>> patch would trigger this; maybe Eric has an idea?
> >>>>>
> >>>>> Does this happen with other qdiscs as well, or is it specific to sfq?
> >>>>
> >>>> This seems like a bug in sfq, we already had recent fixes in it, and
> >>>> other fixes in net/sched vs qdisc_tree_reduce_backlog()
> >>>>
> >>>> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
> >>>
> >>> This seems to be a very old bug, indeed caused by sch->gso_skb
> >>> contribution to sch->q.qlen
> >>>
> >>> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> >>> index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
> >>> 100644
> >>> --- a/net/sched/sch_sfq.c
> >>> +++ b/net/sched/sch_sfq.c
> >>> @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
> >>> struct sk_buff **to_free)
> >>>                 /* It is difficult to believe, but ALL THE SLOTS HAVE
> >>> LENGTH 1. */
> >>>                 x = q->tail->next;
> >>>                 slot = &q->slots[x];
> >>> -               q->tail->next = slot->next;
> >>> +               if (slot->next == x)
> >>> +                       q->tail = NULL; /* no more active slots */
> >>> +               else
> >>> +                       q->tail->next = slot->next;
> >>>                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
> >>>                 goto drop;
> >>>         }
> >>>
> >>
> >> Hi,
> >>
> >> thank you for looking into it.
> >> I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
> >> in office.
> >>
> >
> > I have been using this repro :
> >
> > [...]
>
> Hi,
>
> I can confirm that the sfq qdisc is now stable in this setup, thanks to your fix.
>
> I also experimented with other qdiscs and fq_codel works as well.
>
> The sfq/fq_codel qdisc works hand-in-hand now with Jesper's patch to resolve the original
> issue. Multiple TCP connections run very stable, even when NAPI/XDP is active on the veth
> device, and I can see that the packets are being requeued instead of being dropped in the
> veth driver.
>
> Thank you for your help!

Well, thanks to you for providing a very clean report, including repro
instructions !