netdev - [BUG] veth: TX drops with NAPI enabled and crash in combination with qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <9da42688-bfaa-4364-8797-e9271f3bdaef@hetzner-cloud.de>
Date: Wed, 4 Jun 2025 17:33:36 +0200
From: Marcus Wichelmann <marcus.wichelmann@...zner-cloud.de>
To: Jesper Dangaard Brouer <hawk@...nel.org>, bpf@...r.kernel.org,
 netdev@...r.kernel.org
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>,
 John Fastabend <john.fastabend@...il.com>,
 Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>,
 Jiri Pirko <jiri@...nulli.us>, linux-kernel@...r.kernel.org
Subject: [BUG] veth: TX drops with NAPI enabled and crash in combination with
 qdisc

Hi,

while experimenting with XDP_REDIRECT from a veth-pair to another interface, I
noticed that the veth-pair looses lots of packets when multiple TCP streams go
through it, resulting in stalling TCP connections and noticeable instabilities.

This doesn't seem to be an issue with just XDP but rather occurs whenever the
NAPI mode of the veth driver is active.
I managed to reproduce the same behavior just by bringing the veth-pair into
NAPI mode (see commit d3256efd8e8b ("veth: allow enabling NAPI even without
XDP")) and running multiple TCP streams through it using a network namespace.

Here is how I reproduced it:

  ip netns add lb
  ip link add dev to-lb type veth peer name in-lb netns lb

  # Enable NAPI
  ethtool -K to-lb gro on
  ethtool -K to-lb tso off
  ip netns exec lb ethtool -K in-lb gro on
  ip netns exec lb ethtool -K in-lb tso off

  ip link set dev to-lb up
  ip -netns lb link set dev in-lb up

Then run a HTTP server inside the "lb" namespace that serves a large file:

  fallocate -l 10G testfiles/10GB.bin
  caddy file-server --root testfiles/

Download this file from within the root namespace multiple times in parallel:

  curl http://[fe80::...%to-lb]/10GB.bin -o /dev/null

In my tests, I ran four parallel curls at the same time and after just a few
seconds, three of them stalled while the other one "won" over the full bandwidth
and completed the download.

This is probably a result of the veth's ptr_ring running full, causing many
packet drops on TX, and the TCP congestion control reacting to that.

In this context, I also took notice of Jesper's patch which describes a very
similar issue and should help to resolve this:
  commit dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to
  reduce TX drops")

But when repeating the above test with latest mainline, which includes this
patch, and enabling qdisc via
  tc qdisc add dev in-lb root sfq perturb 10
the Kernel crashed just after starting the second TCP stream (see output below).

So I have two questions:
- Is my understanding of the described issue correct and is Jesper's patch
  sufficient to solve this?
- Is my qdisc configuration to make use of this patch correct and the kernel
  crash is likely a bug?

------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:12
index 65535 is out of range for type 'sfq_head [128]'
CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.15.0+ #1 PREEMPT(voluntary) 
Hardware name: GIGABYTE MP32-AR1-SW-HZ-001/MP32-AR1-00, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
Call trace:
 show_stack+0x24/0x50 (C)
 dump_stack_lvl+0x80/0x140
 dump_stack+0x1c/0x38
 __ubsan_handle_out_of_bounds+0xd0/0x128
 sfq_dequeue+0x37c/0x3e0 [sch_sfq]
 __qdisc_run+0x90/0x760
 net_tx_action+0x1b8/0x3b0
 handle_softirqs+0x13c/0x418
 run_ksoftirqd+0x9c/0xe8
 smpboot_thread_fn+0x1c0/0x2e0
 kthread+0x150/0x230
 ret_from_fork+0x10/0x20
---[ end trace ]---
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:208:8
index 65535 is out of range for type 'sfq_head [128]'
CPU: 1 UID: 0 PID: 24 Comm: ksoftirqd/1 Not tainted 6.15.0+ #1 PREEMPT(voluntary) 
Hardware name: GIGABYTE MP32-AR1-SW-HZ-001/MP32-AR1-00, BIOS F31n (SCP: 2.10.20220810) 09/30/2022
Call trace:
 show_stack+0x24/0x50 (C)
 dump_stack_lvl+0x80/0x140
 dump_stack+0x1c/0x38
 __ubsan_handle_out_of_bounds+0xd0/0x128
 sfq_dequeue+0x394/0x3e0 [sch_sfq]
 __qdisc_run+0x90/0x760
 net_tx_action+0x1b8/0x3b0
 handle_softirqs+0x13c/0x418
 run_ksoftirqd+0x9c/0xe8
 smpboot_thread_fn+0x1c0/0x2e0
 kthread+0x150/0x230
 ret_from_fork+0x10/0x20
---[ end trace ]---
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000005
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=000008002ad67000
[0000000000000005] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 0000000096000004 [#1]  SMP

CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz

# tc qdisc
qdisc sfq 8001: dev in-lb root refcnt 81 limit 127p quantum 1514b depth 127 divisor 1024 perturb 10sec 

Thanks,
Marcus