netdev - Re: [PATCH net] net_sched: limit try_bulk_dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMk6CWor=djYMCj4hV+cAA52TFb7yh7RNLMHTiQjEjwEOw@mail.gmail.com>
Date: Mon, 17 Nov 2025 16:21:21 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuba@...nel.org, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org, 
	xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...gle.com, 
	willemb@...gle.com, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	hawk@...nel.org, patchwork-bot+netdevbpf@...nel.org, toke@...hat.com
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches

Hi Eric,
Sorry - was distracted.

On Fri, Nov 14, 2025 at 1:52 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Fri, Nov 14, 2025 at 9:13 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > > > > >
> > > > > > > > > [..]
> > > > > > > > > Eric,
> > > > > > > > >
> > > > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > > > little curious.
> > > > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > > > contributor to the requeues.
> > > > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > $ uname -r
> > > > > > > > > 6.8.0-87-generic
> > > > > > > > > $
> > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > and 20-30  seconds later:
> > > > > > > > > ---
> > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > > > ---
> > > > > > > > > $ ethtool -i eno1
> > > > > > > > > driver: igc
> > > > > > > > > version: 6.8.0-87-generic
> > > > > > > > > firmware-version: 1085:8770
> > > > > > > > > expansion-rom-version:
> > > > > > > > > bus-info: 0000:02:00.0
> > > > > > > > > supports-statistics: yes
> > > > > > > > > supports-test: yes
> > > > > > > > > supports-eeprom-access: yes
> > > > > > > > > supports-register-dump: yes
> > > > > > > > > supports-priv-flags: yes
> > > > > > > > >
> > > > > > > > > $ ethtool eno1
> > > > > > > > > Settings for eno1:
> > > > > > > > > Supported ports: [ TP ]
> > > > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > >                         1000baseT/Full
> > > > > > > > >                         2500baseT/Full
> > > > > > > > > Supported pause frame use: Symmetric
> > > > > > > > > Supports auto-negotiation: Yes
> > > > > > > > > Supported FEC modes: Not reported
> > > > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > >                         1000baseT/Full
> > > > > > > > >                         2500baseT/Full
> > > > > > > > > Advertised pause frame use: Symmetric
> > > > > > > > > Advertised auto-negotiation: Yes
> > > > > > > > > Advertised FEC modes: Not reported
> > > > > > > > > Speed: 1000Mb/s
> > > > > > > > > Duplex: Full
> > > > > > > > > Auto-negotiation: on
> > > > > > > > > Port: Twisted Pair
> > > > > > > > > PHYAD: 0
> > > > > > > > > Transceiver: internal
> > > > > > > > > MDI-X: off (auto)
> > > > > > > > > netlink error: Operation not permitted
> > > > > > > > >         Current message level: 0x00000007 (7)
> > > > > > > > >                                drv probe link
> > > > > > > > > Link detected: yes
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > > > was not a very common event.
> > > > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > > > return TX BUSY.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > > > because they want to account for the (N) headers.
> > > > > > > >
> > > > > > > > So the core networking stack could pull too many packets from the
> > > > > > > > qdisc for one xmit_more batch,
> > > > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > > > the batch, because BQL limit is hit sooner.
> > > > > > > >
> > > > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > > > not care of extra headers.
> > > > > > > >
> > > > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > > > >
> > > > > > > > igc_tso() ...
> > > > > > > >
> > > > > > > > /* update gso size and bytecount with header size */
> > > > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > > > suffering from the same enthusiasm ;->
> > > > > > > I guess the same codebase..
> > > > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > > > being one of those ;->
> > > > > >
> > > > > > Note that few requeues are ok.
> > > > > >
> > > > > > In my case, I had 5 millions requeues per second, and at that point
> > > > > > you start noticing something is wrong ;)
> > > > >
> > > > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > > > eagerness for TX BUSY implies reduced performance due to the early
> > > > > bailout..
> > > >
> > > > Ok, we are going to do some testing RSN, however, my adhd wont let
> > > > this requeue thing go ;->
> > > >
> > > > So on at least i40e when you start sending say >2Mpps (forwarding or
> > > > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > > > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > > > - and specifically because netdev_tx_sent_queue() stops the queue;
> > > > Subsequent packets from the stack will get the magic TX_BUSY label in
> > > > sch_direct_xmit().
> > > >
> > > > Some context:
> > > > For forwarding use case benchmarking, the typical idea is to use RSS
> > > > and IRQ binding to a specific CPU then craft some traffic patterns on
> > > > the sender so that the test machine has a very smooth distribution
> > > > across the different CPUs i.e goal is to have almost perfect load
> > > > balancing.
> > > >
> > > > Q: In that case, would the defer list ever accumulate more than one
> > > > packet? Gut feeling says no.
> > >
> > > It can accumulate  way more, when there is a mix of very small packets
> > > and big TSO ones.
> > >
> > > IIf you had a lot of large TSO packets being sent, the queue bql/limit
> > > can reach 600,000 easily.
> > > TX completion happens, queue is empty, but latched limit is 600,000
> > > based on the last tx-completion round.
> > >
> > > Then you have small packets of 64 bytes being sent very fast (say from pktgen)
> > >
> > > 600,000 / 64 = 9375
> > >
> > > But most TX queues have a limit of 1024 or 2048 skbs... so they will
> > > stop the queue _before_ BQL does.
> > >
> >
> > Nice description, will check.
> > Remember, our use case is a middle box which receives pkts on one
> > netdev, does some processing, and sends to another. Essentially, we
> > pull from rx ring of src netdev, process and send to tx ring of the
> > other netdev. No batching or multiple CPUs funnelig to one txq and
> > very little if any TSO or locally generated traffic - and of course
> > benchmark is on 64B pkts.
>
> One thing that many drivers get wrong is that they limit the number of
> packets that a napi_poll() can tx-complete at a time.
>

I suppose drivers these days do the replenishing at napi_poll() time -
but it could also be done in the tx path when a driver fails to get
space on tx ring. I think at one point another strategy was to turn on
thresholds for TX completion interrupts, and you get the trigger to
replenish - my gut feel is this last one probably was deemed bad for
performance.

> BQL was meant to adjust its limit based on the number of bytes per round,
> and the fact that the queue has been stopped (because of BQL limit) in
> the last round.
>

For our benchmarking i dont think BQL is adding much value - more below..

> So really, a driver must dequeue as many packets as possible.
>

IIUC, the i40e will replenish up to 256 descriptor which is > than the
default NAPI weight (64).
So should be fine there?
is 256 a good number for a weight of 64? I'm not sure how these
thresholds are chosen; Is it a factor of tx ring size (default of 512,
so 256 is 50%)  or is it based on napi weight? The max size i40e can
do is tx/rx is 8160, it defaults to 512/512.

It used to be that you knew your hardware and its limitations and your
strategy of when and where to replenish was based sometimes on
experimentation.
i40e_clean_tx_irq() is entered on every napi poll for example...

> Otherwise you may have spikes, even if your load is almost constant,
> when under stress.

I see.
So the trick is to use max tx size then then increase the weight for
every replenish? We can set the TX ring to be the max and increase the
replenish size to all..

> In fact I am a bit lost on what your problem is...

Well, it started with observation that there are many requeues ;-> And
in my initial thought was the tx side was not keeping up. And then it
turned out that it was bql that was causing the requeues.

A forwarding example:
--> rx ring x on eth0 --> tc mirred -->tx ring y on eth1 ("tc mirred"
could be replaced with forwarding/bridging)

As you can see the rx softirq will likely run from napi poll all the
way to completion on tx side. Our tests are for 16 flows which are
crafted to distribute nicely via RSS to hit 16 cpus on 16 rx rings
(one per cpu) then fprwarding to 16 tx rings. Each packet is 64B. That
way 16 CPUs are kept busy in parallel. If it all works out there
should be zero lock contention on the tx side..

If we turn off BQL, there should be _zero_ requeues, which in my
thinking should make things faster..
we'll compare with/out bql.

Motivation for all this was your work to add the defer list - i would
like to check if we have a regression for forwarding workloads.
In theory, for our benchmark, we should never be able to accumulate
more than one packet on that defer list (if ever), so all that extra
code is not going to be useful for us.
That is fine as long as the new additional lines of code we are
hitting dont affect us as much..

cheers,
jamal