[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKoMHvYsvmqHB2HmCPp3H8Mmz1FwDD1XmqM6oDJS9+vTA@mail.gmail.com>
Date: Wed, 8 Oct 2025 02:32:09 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>, Kuniyuki Iwashima <kuniyu@...gle.com>,
Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Eric Dumazet <edumazet@...gle.com> writes:
>
> > Remove busylock spinlock and use a lockless list (llist)
> > to reduce spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this patch, we get a 300 % improvement on heavy TX workloads.
> > - Sending twice the number of packets per second.
> > - While consuming 50 % less cycles.
> >
> > Tested:
> >
> > - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >
> > Before:
> >
> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 243 0 0 2368988672 51036 1100852 0 0 146 1 242 60 0 9 91 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 536 10 487745 14718 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 512 0 503067 46033 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 512 0 494807 12107 0 52 48 0 0
> > 244 0 0 2368988672 51036 1100852 0 0 702 26 492845 10110 0 52 48 0 0
> >
> > Lock contention (1 second sample taken on 8 cores)
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 442111 6.79 s 162.47 ms 15.35 us spinlock dev_hard_start_xmit+0xcd
> > 5961 9.57 ms 8.12 us 1.60 us spinlock __dev_queue_xmit+0x3a0
> > 244 560.63 us 7.63 us 2.30 us spinlock do_softirq+0x5b
> > 13 25.09 us 3.21 us 1.93 us spinlock net_tx_action+0xf8
> >
> > If netperf threads are pinned, spinlock stress is very high.
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 964508 7.10 s 147.25 ms 7.36 us spinlock dev_hard_start_xmit+0xcd
> > 201 268.05 us 4.65 us 1.33 us spinlock __dev_queue_xmit+0x3a0
> > 12 26.05 us 3.84 us 2.17 us spinlock do_softirq+0x5b
> >
> > @__dev_queue_xmit_ns:
> > [256, 512) 21 | |
> > [512, 1K) 631 | |
> > [1K, 2K) 27328 |@ |
> > [2K, 4K) 265392 |@@@@@@@@@@@@@@@@ |
> > [4K, 8K) 417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [8K, 16K) 826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16K, 32K) 733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [32K, 64K) 19055 |@ |
> > [64K, 128K) 17240 |@ |
> > [128K, 256K) 25633 |@ |
> > [256K, 512K) 4 | |
> >
> > After:
> >
> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 78 0 0 2369573632 32896 1350988 0 0 22 0 331 254 0 8 92 0 0
> > 75 0 0 2369573632 32896 1350988 0 0 22 50 425713 280199 0 23 76 0 0
> > 104 0 0 2369573632 32896 1350988 0 0 290 0 430238 298247 0 23 76 0 0
> > 86 0 0 2369573632 32896 1350988 0 0 132 0 428019 291865 0 24 76 0 0
> > 90 0 0 2369573632 32896 1350988 0 0 502 0 422498 278672 0 23 76 0 0
> >
> > perf lock record -C0-7 sleep 1; perf lock contention
> > contended total wait max wait avg wait type caller
> >
> > 2524 116.15 ms 486.61 us 46.02 us spinlock __dev_queue_xmit+0x55b
> > 5821 107.18 ms 371.67 us 18.41 us spinlock dev_hard_start_xmit+0xcd
> > 2377 9.73 ms 35.86 us 4.09 us spinlock ___slab_alloc+0x4e0
> > 923 5.74 ms 20.91 us 6.22 us spinlock ___slab_alloc+0x5c9
> > 121 3.42 ms 193.05 us 28.24 us spinlock net_tx_action+0xf8
> > 6 564.33 us 167.60 us 94.05 us spinlock do_softirq+0x5b
> >
> > If netperf threads are pinned (~54 Mpps)
> > perf lock record -C0-7 sleep 1; perf lock contention
> > 32907 316.98 ms 195.98 us 9.63 us spinlock dev_hard_start_xmit+0xcd
> > 4507 61.83 ms 212.73 us 13.72 us spinlock __dev_queue_xmit+0x554
> > 2781 23.53 ms 40.03 us 8.46 us spinlock ___slab_alloc+0x5c9
> > 3554 18.94 ms 34.69 us 5.33 us spinlock ___slab_alloc+0x4e0
> > 233 9.09 ms 215.70 us 38.99 us spinlock do_softirq+0x5b
> > 153 930.66 us 48.67 us 6.08 us spinlock net_tx_action+0xfd
> > 84 331.10 us 14.22 us 3.94 us spinlock ___slab_alloc+0x5c9
> > 140 323.71 us 9.94 us 2.31 us spinlock ___slab_alloc+0x4e0
> >
> > @__dev_queue_xmit_ns:
> > [128, 256) 1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [256, 512) 2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [512, 1K) 483936 |@@@@@@@@@@ |
> > [1K, 2K) 265345 |@@@@@@ |
> > [2K, 4K) 145463 |@@@ |
> > [4K, 8K) 54571 |@ |
> > [8K, 16K) 10270 | |
> > [16K, 32K) 9385 | |
> > [32K, 64K) 7749 | |
> > [64K, 128K) 26799 | |
> > [128K, 256K) 2665 | |
> > [256K, 512K) 665 | |
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>
> This is very cool! One question below, just to make sure I understand
> this correctly:
>
> > ---
> > include/net/sch_generic.h | 4 +-
> > net/core/dev.c | 85 ++++++++++++++++++++++-----------------
> > net/sched/sch_generic.c | 5 ---
> > 3 files changed, 52 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> > struct Qdisc *next_sched;
> > struct sk_buff_head skb_bad_txq;
> >
> > - spinlock_t busylock ____cacheline_aligned_in_smp;
> > + atomic_long_t defer_count ____cacheline_aligned_in_smp;
> > + struct llist_head defer_list;
> > +
> > spinlock_t seqlock;
> >
> > struct rcu_head rcu;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > struct net_device *dev,
> > struct netdev_queue *txq)
> > {
> > + struct sk_buff *next, *to_free = NULL;
> > spinlock_t *root_lock = qdisc_lock(q);
> > - struct sk_buff *to_free = NULL;
> > - bool contended;
> > + struct llist_node *ll_list, *first_n;
> > + unsigned long defer_count = 0;
> > int rc;
> >
> > qdisc_calculate_pkt_len(skb, q);
> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > return rc;
> > }
> >
> > - /*
> > - * Heuristic to force contended enqueues to serialize on a
> > - * separate lock before trying to get qdisc main lock.
> > - * This permits qdisc->running owner to get the lock more
> > - * often and dequeue packets faster.
> > - * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> > - * and then other tasks will only enqueue packets. The packets will be
> > - * sent after the qdisc owner is scheduled again. To prevent this
> > - * scenario the task always serialize on the lock.
> > + /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> > + * In the try_cmpxchg() loop, we want to increment q->defer_count
> > + * at most once to limit the number of skbs in defer_list.
> > + * We perform the defer_count increment only if the list is not empty,
> > + * because some arches have slow atomic_long_inc_return().
> > + */
> > + first_n = READ_ONCE(q->defer_list.first);
> > + do {
> > + if (first_n && !defer_count) {
> > + defer_count = atomic_long_inc_return(&q->defer_count);
> > + if (unlikely(defer_count > q->limit)) {
> > + kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> > + return NET_XMIT_DROP;
> > + }
> > + }
> > + skb->ll_node.next = first_n;
> > + } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> > +
> > + /* If defer_list was not empty, we know the cpu which queued
> > + * the first skb will process the whole list for us.
> > */
> > - contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > - if (unlikely(contended))
> > - spin_lock(&q->busylock);
> > + if (first_n)
> > + return NET_XMIT_SUCCESS;
> >
> > spin_lock(root_lock);
> > +
> > + ll_list = llist_del_all(&q->defer_list);
> > + /* There is a small race because we clear defer_count not atomically
> > + * with the prior llist_del_all(). This means defer_list could grow
> > + * over q->limit.
> > + */
> > + atomic_long_set(&q->defer_count, 0);
> > +
> > + ll_list = llist_reverse_order(ll_list);
> > +
> > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > - __qdisc_drop(skb, &to_free);
> > + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> > + __qdisc_drop(skb, &to_free);
> > rc = NET_XMIT_DROP;
> > - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > - qdisc_run_begin(q)) {
> > + goto unlock;
> > + }
> > + rc = NET_XMIT_SUCCESS;
> > + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > + !llist_next(ll_list) && qdisc_run_begin(q)) {
> > /*
> > * This is a work-conserving queue; there are no old skbs
> > * waiting to be sent out; and the qdisc is not running -
> > * xmit the skb directly.
> > */
> >
> > + skb = llist_entry(ll_list, struct sk_buff, ll_node);
>
> I was puzzled by this^. But AFAICT, the idea is that in the
> non-contended path we're in here (no other CPU enqueueing packets), we
> will still have added the skb to the llist before taking the root_lock,
> and here we pull it back off the list, right?
>
> Even though this is technically not needed (we'll always get the same
> skb back, I think?), so this is mostly for consistency(?)
Exactly. I guess we could instead add an assert like
diff --git a/net/core/dev.c b/net/core/dev.c
index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
* xmit the skb directly.
*/
- skb = llist_entry(ll_list, struct sk_buff, ll_node);
+ DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
struct sk_buff, ll_node));
qdisc_bstats_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
__qdisc_run(q);
Powered by blists - more mailing lists