[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMm37h3Puh=vFxWqh1jFR3ByctOXiK86d=MjUMiQLB-z0Q@mail.gmail.com>
Date: Fri, 14 Nov 2025 11:28:28 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: kuba@...nel.org, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org,
xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...gle.com,
willemb@...gle.com, netdev@...r.kernel.org, eric.dumazet@...il.com,
hawk@...nel.org, patchwork-bot+netdevbpf@...nel.org, toke@...hat.com
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > > > >
> > > > > [..]
> > > > > Eric,
> > > > >
> > > > > So you are correct that requeues exist even before your changes to
> > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > little curious.
> > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > contributor to the requeues.
> > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > >
> > > > > --
> > > > > $ uname -r
> > > > > 6.8.0-87-generic
> > > > > $
> > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > 0 1 1 1 1 1 1 1 1
> > > > > Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > backlog 0b 0p requeues 1528
> > > > > ---
> > > > >
> > > > > and 20-30 seconds later:
> > > > > ---
> > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > 0 1 1 1 1 1 1 1 1
> > > > > Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > backlog 0b 0p requeues 1531
> > > > > ----
> > > > >
> > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > ---
> > > > > $ ethtool -i eno1
> > > > > driver: igc
> > > > > version: 6.8.0-87-generic
> > > > > firmware-version: 1085:8770
> > > > > expansion-rom-version:
> > > > > bus-info: 0000:02:00.0
> > > > > supports-statistics: yes
> > > > > supports-test: yes
> > > > > supports-eeprom-access: yes
> > > > > supports-register-dump: yes
> > > > > supports-priv-flags: yes
> > > > >
> > > > > $ ethtool eno1
> > > > > Settings for eno1:
> > > > > Supported ports: [ TP ]
> > > > > Supported link modes: 10baseT/Half 10baseT/Full
> > > > > 100baseT/Half 100baseT/Full
> > > > > 1000baseT/Full
> > > > > 2500baseT/Full
> > > > > Supported pause frame use: Symmetric
> > > > > Supports auto-negotiation: Yes
> > > > > Supported FEC modes: Not reported
> > > > > Advertised link modes: 10baseT/Half 10baseT/Full
> > > > > 100baseT/Half 100baseT/Full
> > > > > 1000baseT/Full
> > > > > 2500baseT/Full
> > > > > Advertised pause frame use: Symmetric
> > > > > Advertised auto-negotiation: Yes
> > > > > Advertised FEC modes: Not reported
> > > > > Speed: 1000Mb/s
> > > > > Duplex: Full
> > > > > Auto-negotiation: on
> > > > > Port: Twisted Pair
> > > > > PHYAD: 0
> > > > > Transceiver: internal
> > > > > MDI-X: off (auto)
> > > > > netlink error: Operation not permitted
> > > > > Current message level: 0x00000007 (7)
> > > > > drv probe link
> > > > > Link detected: yes
> > > > > ----
> > > > >
> > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > was not a very common event.
> > > > > That can certainly be justified today with several explanations if: a)
> > > > > modern processors getting faster b) the tx code path has become more
> > > > > efficient (true from inspection and your results but those patches are
> > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > return TX BUSY.
> > > > >
> > > > > Thoughts?
> > > >
> > > > requeues can happen because some drivers do not use skb->len for the
> > > > BQL budget, but something bigger for GSO packets,
> > > > because they want to account for the (N) headers.
> > > >
> > > > So the core networking stack could pull too many packets from the
> > > > qdisc for one xmit_more batch,
> > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > the batch, because BQL limit is hit sooner.
> > > >
> > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > not care of extra headers.
> > > >
> > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > >
> > > > igc_tso() ...
> > > >
> > > > /* update gso size and bytecount with header size */
> > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > >
> > >
> > >
> > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > suffering from the same enthusiasm ;->
> > > I guess the same codebase..
> > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > being one of those ;->
> >
> > Note that few requeues are ok.
> >
> > In my case, I had 5 millions requeues per second, and at that point
> > you start noticing something is wrong ;)
>
> That's high ;-> For the nuc with igc, its <1%. Regardless, the
> eagerness for TX BUSY implies reduced performance due to the early
> bailout..
Ok, we are going to do some testing RSN, however, my adhd wont let
this requeue thing go ;->
So on at least i40e when you start sending say >2Mpps (forwarding or
tc mirred redirect ) - the TX BUSY is most certainly not because of
the driver is enthusiastically bailing out. Rather, this is due to BQL
- and specifically because netdev_tx_sent_queue() stops the queue;
Subsequent packets from the stack will get the magic TX_BUSY label in
sch_direct_xmit().
Some context:
For forwarding use case benchmarking, the typical idea is to use RSS
and IRQ binding to a specific CPU then craft some traffic patterns on
the sender so that the test machine has a very smooth distribution
across the different CPUs i.e goal is to have almost perfect load
balancing.
Q: In that case, would the defer list ever accumulate more than one
packet? Gut feeling says no.
Would have been nice to have an optional histogram to see the distribution..
cheers,
jamal
Powered by blists - more mailing lists