netdev - Re: [PATCH RFC net-next 5/5] net: dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKoMHvYsvmqHB2HmCPp3H8Mmz1FwDD1XmqM6oDJS9+vTA@mail.gmail.com>
Date: Wed, 8 Oct 2025 02:32:09 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Jamal Hadi Salim <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, 
	Jiri Pirko <jiri@...nulli.us>, Kuniyuki Iwashima <kuniyu@...gle.com>, 
	Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption

On Wed, Oct 8, 2025 at 1:48 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Eric Dumazet <edumazet@...gle.com> writes:
>
> > Remove busylock spinlock and use a lockless list (llist)
> > to reduce spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this patch, we get a 300 % improvement on heavy TX workloads.
> > - Sending twice the number of packets per second.
> > - While consuming 50 % less cycles.
> >
> > Tested:
> >
> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >
> > Before:
> >
> > 16 Mpps (41 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
> > 244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0
> >
> > Lock contention (1 second sample taken on 8 cores)
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >     442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
> >       5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
> >        244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
> >         13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8
> >
> > If netperf threads are pinned, spinlock stress is very high.
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >     964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
> >        201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
> >         12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b
> >
> > @__dev_queue_xmit_ns:
> > [256, 512)            21 |                                                    |
> > [512, 1K)            631 |                                                    |
> > [1K, 2K)           27328 |@                                                   |
> > [2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
> > [4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
> > [8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> > [32K, 64K)         19055 |@                                                   |
> > [64K, 128K)        17240 |@                                                   |
> > [128K, 256K)       25633 |@                                                   |
> > [256K, 512K)           4 |                                                    |
> >
> > After:
> >
> > 29 Mpps (57 Mpps if each thread is pinned to a different cpu)
> >
> > vmstat 2 5
> > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> > 78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
> > 75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
> > 104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
> > 86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
> > 90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0
> >
> > perf lock record -C0-7 sleep 1; perf lock contention
> >  contended   total wait     max wait     avg wait         type   caller
> >
> >       2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
> >       5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
> >       2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
> >        923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
> >        121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
> >          6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b
> >
> > If netperf threads are pinned (~54 Mpps)
> > perf lock record -C0-7 sleep 1; perf lock contention
> >      32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
> >       4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
> >       2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
> >       3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
> >        233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
> >        153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
> >         84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
> >        140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0
> >
> > @__dev_queue_xmit_ns:
> > [128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
> > [256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [512, 1K)         483936 |@@@@@@@@@@                                          |
> > [1K, 2K)          265345 |@@@@@@                                              |
> > [2K, 4K)          145463 |@@@                                                 |
> > [4K, 8K)           54571 |@                                                   |
> > [8K, 16K)          10270 |                                                    |
> > [16K, 32K)          9385 |                                                    |
> > [32K, 64K)          7749 |                                                    |
> > [64K, 128K)        26799 |                                                    |
> > [128K, 256K)        2665 |                                                    |
> > [256K, 512K)         665 |                                                    |
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>
> This is very cool! One question below, just to make sure I understand
> this correctly:
>
> > ---
> >  include/net/sch_generic.h |  4 +-
> >  net/core/dev.c            | 85 ++++++++++++++++++++++-----------------
> >  net/sched/sch_generic.c   |  5 ---
> >  3 files changed, 52 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -115,7 +115,9 @@ struct Qdisc {
> >       struct Qdisc            *next_sched;
> >       struct sk_buff_head     skb_bad_txq;
> >
> > -     spinlock_t              busylock ____cacheline_aligned_in_smp;
> > +     atomic_long_t           defer_count ____cacheline_aligned_in_smp;
> > +     struct llist_head       defer_list;
> > +
> >       spinlock_t              seqlock;
> >
> >       struct rcu_head         rcu;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..6094768bf3c028f0ad1e52b9b12b7258fa0ecff6 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >                                struct net_device *dev,
> >                                struct netdev_queue *txq)
> >  {
> > +     struct sk_buff *next, *to_free = NULL;
> >       spinlock_t *root_lock = qdisc_lock(q);
> > -     struct sk_buff *to_free = NULL;
> > -     bool contended;
> > +     struct llist_node *ll_list, *first_n;
> > +     unsigned long defer_count = 0;
> >       int rc;
> >
> >       qdisc_calculate_pkt_len(skb, q);
> > @@ -4167,61 +4168,73 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >               return rc;
> >       }
> >
> > -     /*
> > -      * Heuristic to force contended enqueues to serialize on a
> > -      * separate lock before trying to get qdisc main lock.
> > -      * This permits qdisc->running owner to get the lock more
> > -      * often and dequeue packets faster.
> > -      * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
> > -      * and then other tasks will only enqueue packets. The packets will be
> > -      * sent after the qdisc owner is scheduled again. To prevent this
> > -      * scenario the task always serialize on the lock.
> > +     /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> > +      * In the try_cmpxchg() loop, we want to increment q->defer_count
> > +      * at most once to limit the number of skbs in defer_list.
> > +      * We perform the defer_count increment only if the list is not empty,
> > +      * because some arches have slow atomic_long_inc_return().
> > +      */
> > +     first_n = READ_ONCE(q->defer_list.first);
> > +     do {
> > +             if (first_n && !defer_count) {
> > +                     defer_count = atomic_long_inc_return(&q->defer_count);
> > +                     if (unlikely(defer_count > q->limit)) {
> > +                             kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> > +                             return NET_XMIT_DROP;
> > +                     }
> > +             }
> > +             skb->ll_node.next = first_n;
> > +     } while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
> > +
> > +     /* If defer_list was not empty, we know the cpu which queued
> > +      * the first skb will process the whole list for us.
> >        */
> > -     contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > -     if (unlikely(contended))
> > -             spin_lock(&q->busylock);
> > +     if (first_n)
> > +             return NET_XMIT_SUCCESS;
> >
> >       spin_lock(root_lock);
> > +
> > +     ll_list = llist_del_all(&q->defer_list);
> > +     /* There is a small race because we clear defer_count not atomically
> > +      * with the prior llist_del_all(). This means defer_list could grow
> > +      * over q->limit.
> > +      */
> > +     atomic_long_set(&q->defer_count, 0);
> > +
> > +     ll_list = llist_reverse_order(ll_list);
> > +
> >       if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > -             __qdisc_drop(skb, &to_free);
> > +             llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> > +                     __qdisc_drop(skb, &to_free);
> >               rc = NET_XMIT_DROP;
> > -     } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > -                qdisc_run_begin(q)) {
> > +             goto unlock;
> > +     }
> > +     rc = NET_XMIT_SUCCESS;
> > +     if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> > +         !llist_next(ll_list) && qdisc_run_begin(q)) {
> >               /*
> >                * This is a work-conserving queue; there are no old skbs
> >                * waiting to be sent out; and the qdisc is not running -
> >                * xmit the skb directly.
> >                */
> >
> > +             skb = llist_entry(ll_list, struct sk_buff, ll_node);
>
> I was puzzled by this^. But AFAICT, the idea is that in the
> non-contended path we're in here (no other CPU enqueueing packets), we
> will still have added the skb to the llist before taking the root_lock,
> and here we pull it back off the list, right?
>
> Even though this is technically not needed (we'll always get the same
> skb back, I think?), so this is mostly for consistency(?)

Exactly. I guess we could instead add an assert like

diff --git a/net/core/dev.c b/net/core/dev.c
index 1a0baedc4f39e17efd21b0e48a7373a394bcbfa6..4f0e448558a6d2c070d93c474698d904d0b864f6
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4135,7 +4135,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                 * xmit the skb directly.
                 */

-               skb = llist_entry(ll_list, struct sk_buff, ll_node);
+               DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
struct sk_buff, ll_node));
                qdisc_bstats_update(q, skb);
                if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
                        __qdisc_run(q);