netdev - Re: [PATCH RFC net-next 5/5] net: dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iLLSLYNDe8OnsjDuYHBee66p9F7uFuZznri53V8ZYkbQA@mail.gmail.com>
Date: Tue, 7 Oct 2025 04:37:43 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>
Cc: Simon Horman <horms@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>, 
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, 
	eric.dumazet@...il.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption

On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
>

> +
> +       ll_list = llist_del_all(&q->defer_list);
> +       /* There is a small race because we clear defer_count not atomically
> +        * with the prior llist_del_all(). This means defer_list could grow
> +        * over q->limit.
> +        */
> +       atomic_long_set(&q->defer_count, 0);
> +
> +       ll_list = llist_reverse_order(ll_list);
> +
>         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> -               __qdisc_drop(skb, &to_free);
> +               llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> +                       __qdisc_drop(skb, &to_free);
>                 rc = NET_XMIT_DROP;
> -       } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> -                  qdisc_run_begin(q)) {
> +               goto unlock;
> +       }
> +       rc = NET_XMIT_SUCCESS;
> +       if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> +           !llist_next(ll_list) && qdisc_run_begin(q)) {
>                 /*
>                  * This is a work-conserving queue; there are no old skbs
>                  * waiting to be sent out; and the qdisc is not running -
>                  * xmit the skb directly.
>                  */
>
> +               skb = llist_entry(ll_list, struct sk_buff, ll_node);
>                 qdisc_bstats_update(q, skb);
> -
> -               if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
> -                       if (unlikely(contended)) {
> -                               spin_unlock(&q->busylock);
> -                               contended = false;
> -                       }
> +               if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>                         __qdisc_run(q);
> -               }
> -
>                 qdisc_run_end(q);
> -               rc = NET_XMIT_SUCCESS;
>         } else {
> -               rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> -               if (qdisc_run_begin(q)) {
> -                       if (unlikely(contended)) {
> -                               spin_unlock(&q->busylock);
> -                               contended = false;
> -                       }
> -                       __qdisc_run(q);
> -                       qdisc_run_end(q);
> +               llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> +                       prefetch(next);
> +                       dev_qdisc_enqueue(skb, q, &to_free, txq);

Now is the good time to add batch support to some qdisc->enqueue()
where this would help.

For instance fq_enqueue() could take a single ktime_get() sample.