lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKb1a9-dS7g6zauBAechM6Ji7S1F0WAXtSAjYP34+QSqg@mail.gmail.com>
Date: Wed, 15 Oct 2025 15:11:21 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, 
	eric.dumazet@...il.com
Subject: Re: [PATCH v2 net-next 0/6] net: optimize TX throughput and efficiency

On Wed, Oct 15, 2025 at 3:00 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Tue, Oct 14, 2025 at 1:19 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > In this series, I replace the busylock spinlock we have in
> > __dev_queue_xmit() and use lockless list (llist) to reduce
> > spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this series, we get a 300 % (4x) improvement on heavy TX workloads,
> > sending twice the number of packets per second, for half the cpu cycles.
> >
>
> Not important but i am curious: you didnt mention what NIC this was in
> the commit messages ;->

I have used two NIC : IDPF (200Gbit), and GQ (100Gbit Google NIC)
(Usually with 32 TX queues)

And a variety of platforms, up to 512 cores sharing these 32 TX queues.

>
> For the patchset, I have done testing with existing tdc tests and no
> regression..
> It does inspire new things when time becomes available.... so will be
> doing more testing and likely small extensions etc.
> So:
> Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
> Acked-by: Jamal Hadi Salim <jhs@...atatu.com>

Thanks Jamal !

I also have this idea:

cpus serving NIC interrupts and specifically TX completions are often
trapped in also restarting a busy qdisc (because it was stopped by BQL
or the driver's own flow control).

We can do better:

1) In the TX completion loop, collect the skbs and do not free them immediately.

2) Store them in a private list and sum their skb->len while doing so.

3) Then call netdev_tx_completed_queue() call and netif_tx_wake_queue().

If the queue was stopped, this might add the qdisc in our per-cpu
private list (sd->output_queue), raising NET_TX_SOFTIRQ (no immediate
action because napi poll runs while BH are blocked)

4) Then, take care of all dev_consume_skb_any().

Quite often freeing these skbs can take a lot of time, because of mm
contention, and other false sharing, or expensive skb->destructor like
TCP.

5) By the time net_tx_action() finally runs, perhaps another cpu saw
the queue being in XON state and was able to push more packets to the
queue.

This means net_tx_action() might have nothing to do, saving precious cycles.

We should add extra logic in net_tx_action() to not even grab
qdisc_lock() spinlock at all.

My thinking is to add back a sequence (to replace q->running boolean),
and store a snapshot of this sequence every time we restart a queue.
-> net_tx_action can compare the sequence against the last snapshot.

> (For the tc bits, since the majority of the code touches tc related stuff)
>
> cheers,
> jamal
>
>
> > v2: deflake tcp_user_timeout_user-timeout-probe.pkt.
> >     Ability to return a different code than NET_XMIT_SUCCESS
> >     when __dev_xmit_skb() has a single skb to send.
> >
> > Eric Dumazet (6):
> >   selftests/net: packetdrill: unflake
> >     tcp_user_timeout_user-timeout-probe.pkt
> >   net: add add indirect call wrapper in skb_release_head_state()
> >   net/sched: act_mirred: add loop detection
> >   Revert "net/sched: Fix mirred deadlock on device recursion"
> >   net: sched: claim one cache line in Qdisc
> >   net: dev_queue_xmit() llist adoption
> >
> >  include/linux/netdevice_xmit.h                |  9 +-
> >  include/net/sch_generic.h                     | 23 ++---
> >  net/core/dev.c                                | 97 +++++++++++--------
> >  net/core/skbuff.c                             | 11 ++-
> >  net/sched/act_mirred.c                        | 62 +++++-------
> >  net/sched/sch_generic.c                       |  7 --
> >  .../tcp_user_timeout_user-timeout-probe.pkt   |  6 +-
> >  7 files changed, 111 insertions(+), 104 deletions(-)
> >
> > --
> > 2.51.0.788.g6d19910ace-goog
> >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ