netdev - Re: [PATCH v2 net-next 6/6] net: dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAVpQUCp3W6J29LV0xkvYBvUV-Uys_RVt843D2TU6jqiF5f3zg@mail.gmail.com>
Date: Tue, 14 Oct 2025 23:20:21 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, 
	eric.dumazet@...il.com, Toke Høiland-Jørgensen <toke@...hat.com>
Subject: Re: [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption

On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Note that this also allows in the future to submit batches
> to various qdisc->enqueue() methods.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Before:
>
> 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
>
> Lock contention (1 second sample taken on 8 cores)
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
>       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
>        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
>         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
>
> If netperf threads are pinned, spinlock stress is very high.
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
>        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
>         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
>
> @__dev_queue_xmit_ns:
> [256, 512)            21 |                                                    |
> [512, 1K)            631 |                                                    |
> [1K, 2K)           27328 |@                                                   |
> [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32K, 64K)         19055 |@                                                   |
> [64K, 128K)        17240 |@                                                   |
> [128K, 256K)       25633 |@                                                   |
> [256K, 512K)           4 |                                                    |
>
> After:
>
> 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
>
> perf lock record -C0-7 sleep 1; perf lock contention
>  contended   total wait     max wait     avg wait         type   caller
>
>       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
>       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
>       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
>        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
>        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
>          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
>
> If netperf threads are pinned (~54 Mpps)
> perf lock record -C0-7 sleep 1; perf lock contention
>      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
>       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
>       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
>       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
>        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
>        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
>         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
>        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
>
> @__dev_queue_xmit_ns:
> [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)         483936 |@@@@@@@@@@                                          |
> [1K, 2K)          265345 |@@@@@@                                              |
> [2K, 4K)          145463 |@@@                                                 |
> [4K, 8K)           54571 |@                                                   |
> [8K, 16K)          10270 |                                                    |
> [16K, 32K)          9385 |                                                    |
> [32K, 64K)          7749 |                                                    |
> [64K, 128K)        26799 |                                                    |
> [128K, 256K)        2665 |                                                    |
> [256K, 512K)         665 |                                                    |
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Reviewed-by: Toke Høiland-Jørgensen <toke@...hat.com>

Thanks for big improvement!

Reviewed-by: Kuniyuki Iwashima <kuniyu@...gle.com>