[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKb1a9-dS7g6zauBAechM6Ji7S1F0WAXtSAjYP34+QSqg@mail.gmail.com>
Date: Wed, 15 Oct 2025 15:11:21 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
eric.dumazet@...il.com
Subject: Re: [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency
On Wed, Oct 15, 2025 at 3:00 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Tue, Oct 14, 2025 at 1:19 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > In this series, I replace the busylock spinlock we have in
> > __dev_queue_xmit() and use lockless list (llist) to reduce
> > spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this series, we get a 300 % (4x) improvement on heavy TX workloads,
> > sending twice the number of packets per second, for half the cpu cycles.
> >
>
> Not important but i am curious: you didnt mention what NIC this was in
> the commit messages ;->
I have used two NIC : IDPF (200Gbit), and GQ (100Gbit Google NIC)
(Usually with 32 TX queues)
And a variety of platforms, up to 512 cores sharing these 32 TX queues.
>
> For the patchset, I have done testing with existing tdc tests and no
> regression..
> It does inspire new things when time becomes available.... so will be
> doing more testing and likely small extensions etc.
> So:
> Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
> Acked-by: Jamal Hadi Salim <jhs@...atatu.com>
Thanks Jamal !
I also have this idea:
cpus serving NIC interrupts and specifically TX completions are often
trapped in also restarting a busy qdisc (because it was stopped by BQL
or the driver's own flow control).
We can do better:
1) In the TX completion loop, collect the skbs and do not free them immediately.
2) Store them in a private list and sum their skb->len while doing so.
3) Then call netdev_tx_completed_queue() call and netif_tx_wake_queue().
If the queue was stopped, this might add the qdisc in our per-cpu
private list (sd->output_queue), raising NET_TX_SOFTIRQ (no immediate
action because napi poll runs while BH are blocked)
4) Then, take care of all dev_consume_skb_any().
Quite often freeing these skbs can take a lot of time, because of mm
contention, and other false sharing, or expensive skb->destructor like
TCP.
5) By the time net_tx_action() finally runs, perhaps another cpu saw
the queue being in XON state and was able to push more packets to the
queue.
This means net_tx_action() might have nothing to do, saving precious cycles.
We should add extra logic in net_tx_action() to not even grab
qdisc_lock() spinlock at all.
My thinking is to add back a sequence (to replace q->running boolean),
and store a snapshot of this sequence every time we restart a queue.
-> net_tx_action can compare the sequence against the last snapshot.
> (For the tc bits, since the majority of the code touches tc related stuff)
>
> cheers,
> jamal
>
>
> > v2: deflake tcp_user_timeout_user-timeout-probe.pkt.
> > Ability to return a different code than NET_XMIT_SUCCESS
> > when __dev_xmit_skb() has a single skb to send.
> >
> > Eric Dumazet (6):
> > selftests/net: packetdrill: unflake
> > tcp_user_timeout_user-timeout-probe.pkt
> > net: add add indirect call wrapper in skb_release_head_state()
> > net/sched: act_mirred: add loop detection
> > Revert "net/sched: Fix mirred deadlock on device recursion"
> > net: sched: claim one cache line in Qdisc
> > net: dev_queue_xmit() llist adoption
> >
> > include/linux/netdevice_xmit.h | 9 +-
> > include/net/sch_generic.h | 23 ++---
> > net/core/dev.c | 97 +++++++++++--------
> > net/core/skbuff.c | 11 ++-
> > net/sched/act_mirred.c | 62 +++++-------
> > net/sched/sch_generic.c | 7 --
> > .../tcp_user_timeout_user-timeout-probe.pkt | 6 +-
> > 7 files changed, 111 insertions(+), 104 deletions(-)
> >
> > --
> > 2.51.0.788.g6d19910ace-goog
> >
Powered by blists - more mailing lists