linux-kernel - Re: [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f19b555-b0fa-472a-a5f3-6673c0b69c5c@hetzner-cloud.de>
Date: Fri, 6 Jun 2025 00:17:37 +0200
From: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
To: Eric Dumazet <edumazet@...gle.com>,
 Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, bpf@...r.kernel.org,
 netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>,
 John Fastabend <john.fastabend@...il.com>,
 Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Jamal Hadi Salim <jhs@...atatu.com>,
 Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
 linux-kernel@...r.kernel.org
Subject: Re: [BUG] veth: TX drops with NAPI enabled and crash in combination
 with qdisc

Am 06.06.25 um 00:11 schrieb Eric Dumazet:
> On Thu, Jun 5, 2025 at 9:46 AM Eric Dumazet <edumazet@...gle.com> wrote:
>>
>> On Thu, Jun 5, 2025 at 9:15 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>>>
>>> Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de> writes:
>>>
>>>> Hi,
>>>>
>>>> while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
>>>> noticed that the veth-pair looses lots of packets when multiple TCP streams go
>>>> through it, resulting in stalling TCP connections and noticeable instabilities.
>>>>
>>>> This doesn't seem to be an issue with just XDP but rather occurs whenever the
>>>> NAPI mode of the veth driver is active.
>>>> I managed to reproduce the same behavior just by bringing the veth-pair into
>>>> NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
>>>> XDP")) and running multiple TCP streams through it using a network namespace.
>>>>
>>>> Here is how I reproduced it:
>>>>
>>>>   ip netns add lb
>>>>   ip link add dev to-lb type veth peer name in-lb netns lb
>>>>
>>>>   # Enable NAPI
>>>>   ethtool -K to-lb gro on
>>>>   ethtool -K to-lb tso off
>>>>   ip netns exec lb ethtool -K in-lb gro on
>>>>   ip netns exec lb ethtool -K in-lb tso off
>>>>
>>>>   ip link set dev to-lb up
>>>>   ip -netns lb link set dev in-lb up
>>>>
>>>> Then run a HTTP server inside the "lb" namespace that serves a large file:
>>>>
>>>>   fallocate -l 10G testfiles/10GB.bin
>>>>   caddy file-server --root testfiles/
>>>>
>>>> Download this file from within the root namespace multiple times in parallel:
>>>>
>>>>   curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null
>>>>
>>>> In my tests, I ran four parallel curls at the same time and after just a few
>>>> seconds, three of them stalled while the other one "won" over the full bandwidth
>>>> and completed the download.
>>>>
>>>> This is probably a result of the veth's ptr_ring running full, causing many
>>>> packet drops on TX, and the TCP congestion control reacting to that.
>>>>
>>>> In this context, I also took notice of Jesper's patch which describes a very
>>>> similar issue and should help to resolve this:
>>>>   commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
>>>>   reduce TX drops")
>>>>
>>>> But when repeating the above test with latest mainline, which includes this
>>>> patch, and enabling qdisc via
>>>>   tc qdisc add dev in-lb root sfq perturb 10
>>>> the Kernel crashed just after starting the second TCP stream (see output below).
>>>>
>>>> So I have two questions:
>>>> - Is my understanding of the described issue correct and is Jesper's patch
>>>>   sufficient to solve this?
>>>
>>> Hmm, yeah, this does sound likely.
>>>
>>>> - Is my qdisc configuration to make use of this patch correct and the kernel
>>>>   crash is likely a bug?
>>>>
>>>> ------------[ cut here ]------------
>>>> UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
>>>> index 65535 is out of range for type 'sfq_head [128]'
>>>
>>> This (the 'index 65535') kinda screams "integer underflow". So certainly
>>> looks like a kernel bug, yeah. Don't see any obvious reason why Jesper's
>>> patch would trigger this; maybe Eric has an idea?
>>>
>>> Does this happen with other qdiscs as well, or is it specific to sfq?
>>
>> This seems like a bug in sfq, we already had recent fixes in it, and
>> other fixes in net/sched vs qdisc_tree_reduce_backlog()
>>
>> It is possible qdisc_pkt_len() could be wrong in this use case (TSO off ?)
> 
> This seems to be a very old bug, indeed caused by sch->gso_skb
> contribution to sch->q.qlen
> 
> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> index b912ad99aa15d95b297fb28d0fd0baa9c21ab5cd..77fa02f2bfcd56a36815199aa2e7987943ea226f
> 100644
> --- a/net/sched/sch_sfq.c
> +++ b/net/sched/sch_sfq.c
> @@ -310,7 +310,10 @@ static unsigned int sfq_drop(struct Qdisc *sch,
> struct sk_buff **to_free)
>                 /* It is difficult to believe, but ALL THE SLOTS HAVE
> LENGTH 1. */
>                 x = q->tail->next;
>                 slot = &q->slots[x];
> -               q->tail->next = slot->next;
> +               if (slot->next == x)
> +                       q->tail = NULL; /* no more active slots */
> +               else
> +                       q->tail->next = slot->next;
>                 q->ht[slot->hash] = SFQ_EMPTY_SLOT;
>                 goto drop;
>         }
> 

Hi,

thank you for looking into it.
I'll give your patch a try and will also do tests with other qdiscs as well when I'm back
in office.

Marcus