[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLLSLYNDe8OnsjDuYHBee66p9F7uFuZznri53V8ZYkbQA@mail.gmail.com>
Date: Tue, 7 Oct 2025 04:37:43 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>
Cc: Simon Horman <horms@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com
Subject: Re: [PATCH RFC net-next 5/5] net: dev_queue_xmit() llist adoption
On Mon, Oct 6, 2025 at 12:31 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
>
> +
> + ll_list = llist_del_all(&q->defer_list);
> + /* There is a small race because we clear defer_count not atomically
> + * with the prior llist_del_all(). This means defer_list could grow
> + * over q->limit.
> + */
> + atomic_long_set(&q->defer_count, 0);
> +
> + ll_list = llist_reverse_order(ll_list);
> +
> if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> - __qdisc_drop(skb, &to_free);
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node)
> + __qdisc_drop(skb, &to_free);
> rc = NET_XMIT_DROP;
> - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> - qdisc_run_begin(q)) {
> + goto unlock;
> + }
> + rc = NET_XMIT_SUCCESS;
> + if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
> + !llist_next(ll_list) && qdisc_run_begin(q)) {
> /*
> * This is a work-conserving queue; there are no old skbs
> * waiting to be sent out; and the qdisc is not running -
> * xmit the skb directly.
> */
>
> + skb = llist_entry(ll_list, struct sk_buff, ll_node);
> qdisc_bstats_update(q, skb);
> -
> - if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
> - if (unlikely(contended)) {
> - spin_unlock(&q->busylock);
> - contended = false;
> - }
> + if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> __qdisc_run(q);
> - }
> -
> qdisc_run_end(q);
> - rc = NET_XMIT_SUCCESS;
> } else {
> - rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> - if (qdisc_run_begin(q)) {
> - if (unlikely(contended)) {
> - spin_unlock(&q->busylock);
> - contended = false;
> - }
> - __qdisc_run(q);
> - qdisc_run_end(q);
> + llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
> + prefetch(next);
> + dev_qdisc_enqueue(skb, q, &to_free, txq);
Now is the good time to add batch support to some qdisc->enqueue()
where this would help.
For instance fq_enqueue() could take a single ktime_get() sample.
Powered by blists - more mailing lists