[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAVpQUCp3W6J29LV0xkvYBvUV-Uys_RVt843D2TU6jqiF5f3zg@mail.gmail.com>
Date: Tue, 14 Oct 2025 23:20:21 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com, Toke Høiland-Jørgensen <toke@...hat.com>
Subject: Re: [PATCH v2 net-next 6/6] net: dev_queue_xmit() llist adoption
On Tue, Oct 14, 2025 at 10:19 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Note that this also allows in the future to submit batches
> to various qdisc->enqueue() methods.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Before:
>
> 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
> 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
> 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
>
> Lock contention (1 second sample taken on 8 cores)
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
> 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
> 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
> 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
>
> If netperf threads are pinned, spinlock stress is very high.
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
> 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
> 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
>
> @__dev_queue_xmit_ns:
> [256, 512) 21 | |
> [512, 1K) 631 | |
> [1K, 2K) 27328 |@ |
> [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
> [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [32K, 64K) 19055 |@ |
> [64K, 128K) 17240 |@ |
> [128K, 256K) 25633 |@ |
> [256K, 512K) 4 | |
>
> After:
>
> 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>
> vmstat 2 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
> 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
> 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
> 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
> 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
>
> perf lock record -C0-7 sleep 1; perf lock contention
> contended total wait max wait avg wait type caller
>
> 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
> 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
> 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
> 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
> 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
> 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
>
> If netperf threads are pinned (~54 Mpps)
> perf lock record -C0-7 sleep 1; perf lock contention
> 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
> 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
> 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
> 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
> 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
> 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
> 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
> 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
>
> @__dev_queue_xmit_ns:
> [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K) 483936 |@@@@@@@@@@ |
> [1K, 2K) 265345 |@@@@@@ |
> [2K, 4K) 145463 |@@@ |
> [4K, 8K) 54571 |@ |
> [8K, 16K) 10270 | |
> [16K, 32K) 9385 | |
> [32K, 64K) 7749 | |
> [64K, 128K) 26799 | |
> [128K, 256K) 2665 | |
> [256K, 512K) 665 | |
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Reviewed-by: Toke Høiland-Jørgensen <toke@...hat.com>
Thanks for big improvement!
Reviewed-by: Kuniyuki Iwashima <kuniyu@...gle.com>
Powered by blists - more mailing lists