[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ms61k2y9.fsf@toke.dk>
Date: Wed, 08 Oct 2025 14:11:58 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski
<kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman
<horms@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang
<xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, Kuniyuki
Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
Eric Dumazet <edumazet@...gle.com> writes:
> On Wed, Oct 8, 2025 at 3:05 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>>
>> Eric Dumazet <edumazet@...gle.com> writes:
>>
>> > On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>> >>
>> >> Eric Dumazet <edumazet@...gle.com> writes:
>> >>
>> >> > Remove busylock spinlock and use a lockless list (llist)
>> >> > to reduce spinlock contention to the minimum.
>> >> >
>> >> > Idea is that only one cpu might spin on the qdisc spinlock,
>> >> > while others simply add their skb in the llist.
>> >> >
>> >> > After this patch, we get a 300 % improvement on heavy TX workloads.
>> >> > - Sending twice the number of packets per second.
>> >> > - While consuming 50 % less cycles.
>> >> >
>> >> > Tested:
>> >> >
>> >> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
>> >> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
>> >> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
>> >> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>> >> >
>> >> > Before:
>> >> >
>> >> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> >> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
>> >> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
>> >> >
>> >> > Lock contention (1 second sample taken on 8 cores)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
>> >> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
>> >> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
>> >> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
>> >> >
>> >> > If netperf threads are pinned, spinlock stress is very high.
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
>> >> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
>> >> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [256, 512) 21 | |
>> >> > [512, 1K) 631 | |
>> >> > [1K, 2K) 27328 |@ |
>> >> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
>> >> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [32K, 64K) 19055 |@ |
>> >> > [64K, 128K) 17240 |@ |
>> >> > [128K, 256K) 25633 |@ |
>> >> > [256K, 512K) 4 | |
>> >> >
>> >> > After:
>> >> >
>> >> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
>> >> >
>> >> > vmstat 2 5
>> >> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
>> >> > r b swpd free buff cache si so bi bo in cs us sy id wa st
>> >> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
>> >> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
>> >> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
>> >> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
>> >> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
>> >> >
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > contended total wait max wait avg wait type caller
>> >> >
>> >> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
>> >> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
>> >> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
>> >> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
>> >> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
>> >> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
>> >> >
>> >> > If netperf threads are pinned (~54 Mpps)
>> >> > perf lock record -C0-7 sleep 1; perf lock contention
>> >> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
>> >> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
>> >> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
>> >> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
>> >> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
>> >> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
>> >> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
>> >> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
>> >> >
>> >> > @__dev_queue_xmit_ns:
>> >> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>> >> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> >> > [512, 1K) 483936 |@@@@@@@@@@ |
>> >> > [1K, 2K) 265345 |@@@@@@ |
>> >> > [2K, 4K) 145463 |@@@ |
>> >> > [4K, 8K) 54571 |@ |
>> >> > [8K, 16K) 10270 | |
>> >> > [16K, 32K) 9385 | |
>> >> > [32K, 64K) 7749 | |
>> >> > [64K, 128K) 26799 | |
>> >> > [128K, 256K) 2665 | |
>> >> > [256K, 512K) 665 | |
>> >> >
>> >> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>> >>
>> >> This is very cool! One question below, just to make sure I understand
>> >> this correctly:
>> >>
>> >> > ---
>> >> > include/net/sch_generic.h | 4 +-
>> >> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
>> >> > net/sched/sch_generic.c | 5 ---
>> >> > 3 files changed, 52 insertions(+), 42 deletions(-)
>> >> >
>> >> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> >> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
>> >> > --- a/include/net/sch_generic.h
>> >> > +++ b/include/net/sch_generic.h
>> >> > @@ -115,7 +115,9 @@ struct Qdisc {
>> >> > struct Qdisc *next_sched;
>> >> > struct sk_buff_head skb_bad_txq;
>> >> >
>> >> > - spinlock_t busylock ____cacheline_aligned_in_smp;
>> >> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
>> >> > + struct llist_head defer_list;
>> >> > +
>> >> > spinlock_t seqlock;
>> >> >
>> >> > struct rcu_head rcu;
>> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
>> >> > --- a/net/core/dev.c
>> >> > +++ b/net/core/dev.c
>> >> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> > struct net_device *dev,
>> >> > struct netdev_queue *txq)
>> >> > {
>> >> > + struct sk_buff *next, *to_free = NULL;
>> >> > spinlock_t *root_lock = qdisc_lock(q);
>> >> > - struct sk_buff *to_free = NULL;
>> >> > - bool contended;
>> >> > + struct llist_node *ll_list, *first_n;
>> >> > + unsigned long defer_count = 0;
>> >> > int rc;
>> >> >
>> >> > qdisc_calculate_pkt_len(skb, q);
>> >> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>> >> > return rc;
>> >> > }
>> >> >
>> >> > - /*
>> >> > - * Heuristic to force contended enqueues to serialize on a
>> >> > - * separate lock before trying to get qdisc main lock.
>> >> > - * This permits qdisc->running owner to get the lock more
>> >> > - * often and dequeue packets faster.
>> >> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
>> >> > - * and then other tasks will only enqueue packets. The packets will be
>> >> > - * sent after the qdisc owner is scheduled again. To prevent this
>> >> > - * scenario the task always serialize on the lock.
>> >> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> >> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
>> >> > + * at most once to limit the number of skbs in defer_list.
>> >> > + * We perform the defer_count increment only if the list is not empty,
>> >> > + * because some arches have slow atomic_long_inc_return().
>> >> > + */
>> >> > + first_n = READ_ONCE(q->defer_list.first);
>> >> > + do {
>> >> > + if (first_n && !defer_count) {
>> >> > + defer_count = atomic_long_inc_return(&q->defer_count);
>> >> > + if (unlikely(defer_count > q->limit)) {
>> >> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>> >> > + return NET_XMIT_DROP;
>> >> > + }
>> >> > + }
>> >> > + skb->ll_node.next = first_n;
>> >> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
>> >> > +
>> >> > + /* If defer_list was not empty, we know the cpu which queued
>> >> > + * the first skb will process the whole list for us.
>> >> > */
>> >> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
>> >> > - if (unlikely(contended))
>> >> > - spin_lock(&q->busylock);
>> >> > + if (first_n)
>> >> > + return NET_XMIT_SUCCESS;
>> >> >
>> >> > spin_lock(root_lock);
>> >> > +
>> >> > + ll_list = llist_del_all(&q->defer_list);
>> >> > + /* There is a small race because we clear defer_count not atomically
>> >> > + * with the prior llist_del_all(). This means defer_list could grow
>> >> > + * over q->limit.
>> >> > + */
>> >> > + atomic_long_set(&q->defer_count, 0);
>> >> > +
>> >> > + ll_list = llist_reverse_order(ll_list);
>> >> > +
>> >> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>> >> > - __qdisc_drop(skb, &to_free);
>> >> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
>> >> > + __qdisc_drop(skb, &to_free);
>> >> > rc = NET_XMIT_DROP;
>> >> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > - qdisc_run_begin(q)) {
>> >> > + goto unlock;
>> >> > + }
>> >> > + rc = NET_XMIT_SUCCESS;
>> >> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
>> >> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
>> >> > /*
>> >> > * This is a work-conserving queue; there are no old skbs
>> >> > * waiting to be sent out; and the qdisc is not running -
>> >> > * xmit the skb directly.
>> >> > */
>> >> >
>> >> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> >>
>> >> I was puzzled by this^. But AFAICT, the idea is that in the
>> >> non-contended path we're in here (no other CPU enqueueing packets), we
>> >> will still have added the skb to the llist before taking the root_lock,
>> >> and here we pull it back off the list, right?
>> >>
>> >> Even though this is technically not needed (we'll always get the same
>> >> skb back, I think?), so this is mostly for consistency(?)
>> >
>> > Exactly. I guess we could instead add an assert like
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
>> > 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
>> > *skb, struct Qdisc *q,
>> > * xmit the skb directly.
>> > */
>> >
>> > - skb = llist_entry(ll_list, struct sk_buff, ll_node);
>> > + DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
>> > struct sk_buff, ll_node));
>> > qdisc_bstats_update(q, skb);
>> > if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>> > __qdisc_run(q);
>>
>> OK, so thinking about this some more, isn't there a race between the
>>
>> if (first_n)
>> return NET_XMIT_SUCCESS;
>>
>> and taking the lock? I.e., two different CPUs can pass that check in
>> which case one of them will end up spinning on the lock, and by the time
>> it acquires it, there is no longer any guarantee that the skb on the
>> llist will be the same one that we started with? Or indeed that there
>> will be any skbs on the list at all?
>
> Only one cpu can observe the list was empty.
>
> This (spinlock)less list still has atomic guarantees, thanks to cmpxchg().
Ah, right, missed the bit where try_cmpxchg would update first_n if it
doesn't succeed...
> So after this only cpu passes the spinlock() barrier, the list has at
> least one skb in it.
>
> If there is a single skb in the list after the xchg() got it
> atomically, then it must be its own skb.
Gotcha! In that case I agree the DEBUG_NET_WARN_ON_ONCE() is clearer
than reassigning the skb pointer.
-Toke
Powered by blists - more mailing lists