netdev - Re: [PATCH net] net_sched: limit try_bulk_dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKv7QcSqUjSVDSuZgn+tobBdaH8tszirY8nYm2C0Mk4UQ@mail.gmail.com>
Date: Fri, 14 Nov 2025 08:35:59 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: kuba@...nel.org, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org, 
	xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...gle.com, 
	willemb@...gle.com, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	hawk@...nel.org, patchwork-bot+netdevbpf@...nel.org, toke@...hat.com
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches

On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > >
> > > > > > [..]
> > > > > > Eric,
> > > > > >
> > > > > > So you are correct that requeues exist even before your changes to
> > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > little curious.
> > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > contributor to the requeues.
> > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > >
> > > > > > --
> > > > > > $ uname -r
> > > > > > 6.8.0-87-generic
> > > > > > $
> > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > 0 1 1 1 1 1 1 1 1
> > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > >  backlog 0b 0p requeues 1528
> > > > > > ---
> > > > > >
> > > > > > and 20-30  seconds later:
> > > > > > ---
> > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > 0 1 1 1 1 1 1 1 1
> > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > >  backlog 0b 0p requeues 1531
> > > > > > ----
> > > > > >
> > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > ---
> > > > > > $ ethtool -i eno1
> > > > > > driver: igc
> > > > > > version: 6.8.0-87-generic
> > > > > > firmware-version: 1085:8770
> > > > > > expansion-rom-version:
> > > > > > bus-info: 0000:02:00.0
> > > > > > supports-statistics: yes
> > > > > > supports-test: yes
> > > > > > supports-eeprom-access: yes
> > > > > > supports-register-dump: yes
> > > > > > supports-priv-flags: yes
> > > > > >
> > > > > > $ ethtool eno1
> > > > > > Settings for eno1:
> > > > > > Supported ports: [ TP ]
> > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > >                         100baseT/Half 100baseT/Full
> > > > > >                         1000baseT/Full
> > > > > >                         2500baseT/Full
> > > > > > Supported pause frame use: Symmetric
> > > > > > Supports auto-negotiation: Yes
> > > > > > Supported FEC modes: Not reported
> > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > >                         100baseT/Half 100baseT/Full
> > > > > >                         1000baseT/Full
> > > > > >                         2500baseT/Full
> > > > > > Advertised pause frame use: Symmetric
> > > > > > Advertised auto-negotiation: Yes
> > > > > > Advertised FEC modes: Not reported
> > > > > > Speed: 1000Mb/s
> > > > > > Duplex: Full
> > > > > > Auto-negotiation: on
> > > > > > Port: Twisted Pair
> > > > > > PHYAD: 0
> > > > > > Transceiver: internal
> > > > > > MDI-X: off (auto)
> > > > > > netlink error: Operation not permitted
> > > > > >         Current message level: 0x00000007 (7)
> > > > > >                                drv probe link
> > > > > > Link detected: yes
> > > > > > ----
> > > > > >
> > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > was not a very common event.
> > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > efficient (true from inspection and your results but those patches are
> > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > return TX BUSY.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > BQL budget, but something bigger for GSO packets,
> > > > > because they want to account for the (N) headers.
> > > > >
> > > > > So the core networking stack could pull too many packets from the
> > > > > qdisc for one xmit_more batch,
> > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > the batch, because BQL limit is hit sooner.
> > > > >
> > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > not care of extra headers.
> > > > >
> > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > >
> > > > > igc_tso() ...
> > > > >
> > > > > /* update gso size and bytecount with header size */
> > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > >
> > > >
> > > >
> > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > suffering from the same enthusiasm ;->
> > > > I guess the same codebase..
> > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > being one of those ;->
> > >
> > > Note that few requeues are ok.
> > >
> > > In my case, I had 5 millions requeues per second, and at that point
> > > you start noticing something is wrong ;)
> >
> > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > eagerness for TX BUSY implies reduced performance due to the early
> > bailout..
>
> Ok, we are going to do some testing RSN, however, my adhd wont let
> this requeue thing go ;->
>
> So on at least i40e when you start sending say >2Mpps (forwarding or
> tc mirred redirect ) - the TX BUSY is most certainly not because of
> the driver is enthusiastically bailing out. Rather, this is due to BQL
> - and specifically because netdev_tx_sent_queue() stops the queue;
> Subsequent packets from the stack will get the magic TX_BUSY label in
> sch_direct_xmit().
>
> Some context:
> For forwarding use case benchmarking, the typical idea is to use RSS
> and IRQ binding to a specific CPU then craft some traffic patterns on
> the sender so that the test machine has a very smooth distribution
> across the different CPUs i.e goal is to have almost perfect load
> balancing.
>
> Q: In that case, would the defer list ever accumulate more than one
> packet? Gut feeling says no.

It can accumulate  way more, when there is a mix of very small packets
and big TSO ones.

IIf you had a lot of large TSO packets being sent, the queue bql/limit
can reach 600,000 easily.
TX completion happens, queue is empty, but latched limit is 600,000
based on the last tx-completion round.

Then you have small packets of 64 bytes being sent very fast (say from pktgen)

600,000 / 64 = 9375

But most TX queues have a limit of 1024 or 2048 skbs... so they will
stop the queue _before_ BQL does.


> Would have been nice to have an optional histogram to see the distribution..

bpftrace is a nice way for building histograms.

>
> cheers,
> jamal