netdev - Re: [PATCH net] net_sched: limit try_bulk_dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=Z=eAMhSL44FT6jf1WMiX=nVuTyuNka8NMm+HRFPuhEg@mail.gmail.com>
Date: Tue, 18 Nov 2025 10:33:32 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuba@...nel.org, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org, 
	xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...gle.com, 
	willemb@...gle.com, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	hawk@...nel.org, patchwork-bot+netdevbpf@...nel.org, toke@...hat.com
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches

On Mon, Nov 17, 2025 at 4:21 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> Hi Eric,
> Sorry - was distracted.
>
> On Fri, Nov 14, 2025 at 1:52 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 9:13 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > >
> > > On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > > > > > > >
> > > > > > > > > > [..]
> > > > > > > > > > Eric,
> > > > > > > > > >
> > > > > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > > > > little curious.
> > > > > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > > > > contributor to the requeues.
> > > > > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > $ uname -r
> > > > > > > > > > 6.8.0-87-generic
> > > > > > > > > > $
> > > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > > > > ---
> > > > > > > > > >
> > > > > > > > > > and 20-30  seconds later:
> > > > > > > > > > ---
> > > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > > > > ---
> > > > > > > > > > $ ethtool -i eno1
> > > > > > > > > > driver: igc
> > > > > > > > > > version: 6.8.0-87-generic
> > > > > > > > > > firmware-version: 1085:8770
> > > > > > > > > > expansion-rom-version:
> > > > > > > > > > bus-info: 0000:02:00.0
> > > > > > > > > > supports-statistics: yes
> > > > > > > > > > supports-test: yes
> > > > > > > > > > supports-eeprom-access: yes
> > > > > > > > > > supports-register-dump: yes
> > > > > > > > > > supports-priv-flags: yes
> > > > > > > > > >
> > > > > > > > > > $ ethtool eno1
> > > > > > > > > > Settings for eno1:
> > > > > > > > > > Supported ports: [ TP ]
> > > > > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > > >                         1000baseT/Full
> > > > > > > > > >                         2500baseT/Full
> > > > > > > > > > Supported pause frame use: Symmetric
> > > > > > > > > > Supports auto-negotiation: Yes
> > > > > > > > > > Supported FEC modes: Not reported
> > > > > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > > >                         1000baseT/Full
> > > > > > > > > >                         2500baseT/Full
> > > > > > > > > > Advertised pause frame use: Symmetric
> > > > > > > > > > Advertised auto-negotiation: Yes
> > > > > > > > > > Advertised FEC modes: Not reported
> > > > > > > > > > Speed: 1000Mb/s
> > > > > > > > > > Duplex: Full
> > > > > > > > > > Auto-negotiation: on
> > > > > > > > > > Port: Twisted Pair
> > > > > > > > > > PHYAD: 0
> > > > > > > > > > Transceiver: internal
> > > > > > > > > > MDI-X: off (auto)
> > > > > > > > > > netlink error: Operation not permitted
> > > > > > > > > >         Current message level: 0x00000007 (7)
> > > > > > > > > >                                drv probe link
> > > > > > > > > > Link detected: yes
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > > > > was not a very common event.
> > > > > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > > > > return TX BUSY.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > > > > because they want to account for the (N) headers.
> > > > > > > > >
> > > > > > > > > So the core networking stack could pull too many packets from the
> > > > > > > > > qdisc for one xmit_more batch,
> > > > > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > > > > the batch, because BQL limit is hit sooner.
> > > > > > > > >
> > > > > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > > > > not care of extra headers.
> > > > > > > > >
> > > > > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > > > > >
> > > > > > > > > igc_tso() ...
> > > > > > > > >
> > > > > > > > > /* update gso size and bytecount with header size */
> > > > > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > > > > suffering from the same enthusiasm ;->
> > > > > > > > I guess the same codebase..
> > > > > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > > > > being one of those ;->
> > > > > > >
> > > > > > > Note that few requeues are ok.
> > > > > > >
> > > > > > > In my case, I had 5 millions requeues per second, and at that point
> > > > > > > you start noticing something is wrong ;)
> > > > > >
> > > > > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > > > > eagerness for TX BUSY implies reduced performance due to the early
> > > > > > bailout..
> > > > >
> > > > > Ok, we are going to do some testing RSN, however, my adhd wont let
> > > > > this requeue thing go ;->
> > > > >
> > > > > So on at least i40e when you start sending say >2Mpps (forwarding or
> > > > > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > > > > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > > > > - and specifically because netdev_tx_sent_queue() stops the queue;
> > > > > Subsequent packets from the stack will get the magic TX_BUSY label in
> > > > > sch_direct_xmit().
> > > > >
> > > > > Some context:
> > > > > For forwarding use case benchmarking, the typical idea is to use RSS
> > > > > and IRQ binding to a specific CPU then craft some traffic patterns on
> > > > > the sender so that the test machine has a very smooth distribution
> > > > > across the different CPUs i.e goal is to have almost perfect load
> > > > > balancing.
> > > > >
> > > > > Q: In that case, would the defer list ever accumulate more than one
> > > > > packet? Gut feeling says no.
> > > >
> > > > It can accumulate  way more, when there is a mix of very small packets
> > > > and big TSO ones.
> > > >
> > > > IIf you had a lot of large TSO packets being sent, the queue bql/limit
> > > > can reach 600,000 easily.
> > > > TX completion happens, queue is empty, but latched limit is 600,000
> > > > based on the last tx-completion round.
> > > >
> > > > Then you have small packets of 64 bytes being sent very fast (say from pktgen)
> > > >
> > > > 600,000 / 64 = 9375
> > > >
> > > > But most TX queues have a limit of 1024 or 2048 skbs... so they will
> > > > stop the queue _before_ BQL does.
> > > >
> > >
> > > Nice description, will check.
> > > Remember, our use case is a middle box which receives pkts on one
> > > netdev, does some processing, and sends to another. Essentially, we
> > > pull from rx ring of src netdev, process and send to tx ring of the
> > > other netdev. No batching or multiple CPUs funnelig to one txq and
> > > very little if any TSO or locally generated traffic - and of course
> > > benchmark is on 64B pkts.
> >
> > One thing that many drivers get wrong is that they limit the number of
> > packets that a napi_poll() can tx-complete at a time.
> >
>
> I suppose drivers these days do the replenishing at napi_poll() time -
> but it could also be done in the tx path when a driver fails to get
> space on tx ring. I think at one point another strategy was to turn on
> thresholds for TX completion interrupts, and you get the trigger to
> replenish - my gut feel is this last one probably was deemed bad for
> performance.
>
> > BQL was meant to adjust its limit based on the number of bytes per round,
> > and the fact that the queue has been stopped (because of BQL limit) in
> > the last round.
> >
>
> For our benchmarking i dont think BQL is adding much value - more below..
>
> > So really, a driver must dequeue as many packets as possible.
> >
>
> IIUC, the i40e will replenish up to 256 descriptor which is > than the
> default NAPI weight (64).
> So should be fine there?
> is 256 a good number for a weight of 64? I'm not sure how these
> thresholds are chosen; Is it a factor of tx ring size (default of 512,
> so 256 is 50%)  or is it based on napi weight? The max size i40e can
> do is tx/rx is 8160, it defaults to 512/512.
>
> It used to be that you knew your hardware and its limitations and your
> strategy of when and where to replenish was based sometimes on
> experimentation.
> i40e_clean_tx_irq() is entered on every napi poll for example...
>
> > Otherwise you may have spikes, even if your load is almost constant,
> > when under stress.
>
> I see.
> So the trick is to use max tx size then then increase the weight for
> every replenish? We can set the TX ring to be the max and increase the
> replenish size to all..
>
> > In fact I am a bit lost on what your problem is...
>
> Well, it started with observation that there are many requeues ;-> And
> in my initial thought was the tx side was not keeping up. And then it
> turned out that it was bql that was causing the requeues.
>
> A forwarding example:
> --> rx ring x on eth0 --> tc mirred -->tx ring y on eth1 ("tc mirred"
> could be replaced with forwarding/bridging)
>
> As you can see the rx softirq will likely run from napi poll all the
> way to completion on tx side. Our tests are for 16 flows which are
> crafted to distribute nicely via RSS to hit 16 cpus on 16 rx rings
> (one per cpu) then fprwarding to 16 tx rings. Each packet is 64B. That
> way 16 CPUs are kept busy in parallel. If it all works out there
> should be zero lock contention on the tx side..
>
> If we turn off BQL, there should be _zero_ requeues, which in my
> thinking should make things faster..
> we'll compare with/out bql.
>
> Motivation for all this was your work to add the defer list - i would
> like to check if we have a regression for forwarding workloads.
> In theory, for our benchmark, we should never be able to accumulate
> more than one packet on that defer list (if ever), so all that extra
> code is not going to be useful for us.
> That is fine as long as the new additional lines of code we are
> hitting dont affect us as much..
>

And the summary is: There's a very tiny regression - but it is not
really noticeable. The test was in the 10Mpps+ and although the
difference is consistent/repeatable it falls within margin of error of
1-2Kpps.
So, it should be fine..
Still interested in the questions i asked though ;->

cheers,
jamal