netdev - Re: [PATCH net] net_sched: limit try_bulk_dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJdK4e-5PCC3fzrC0=7NJm8yXZYcrMckS9oE1sZNmzPPw@mail.gmail.com>
Date: Thu, 13 Nov 2025 10:35:59 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: kuba@...nel.org, davem@...emloft.net, pabeni@...hat.com, horms@...nel.org, 
	xiyou.wangcong@...il.com, jiri@...nulli.us, kuniyu@...gle.com, 
	willemb@...gle.com, netdev@...r.kernel.org, eric.dumazet@...il.com, 
	hawk@...nel.org, patchwork-bot+netdevbpf@...nel.org, toke@...hat.com
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches

On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> > >
> > > [..]
> > > Eric,
> > >
> > > So you are correct that requeues exist even before your changes to
> > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > variant exhibit this phenoma with very low traffic... which got me a
> > > little curious.
> > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > contributor to the requeues.
> > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > >
> > > --
> > > $ uname -r
> > > 6.8.0-87-generic
> > > $
> > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > 0 1 1 1 1 1 1 1 1
> > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > >  backlog 0b 0p requeues 1528
> > > ---
> > >
> > > and 20-30  seconds later:
> > > ---
> > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > 0 1 1 1 1 1 1 1 1
> > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > >  backlog 0b 0p requeues 1531
> > > ----
> > >
> > > Reel cheep NIC doing 1G with 4 tx rings:
> > > ---
> > > $ ethtool -i eno1
> > > driver: igc
> > > version: 6.8.0-87-generic
> > > firmware-version: 1085:8770
> > > expansion-rom-version:
> > > bus-info: 0000:02:00.0
> > > supports-statistics: yes
> > > supports-test: yes
> > > supports-eeprom-access: yes
> > > supports-register-dump: yes
> > > supports-priv-flags: yes
> > >
> > > $ ethtool eno1
> > > Settings for eno1:
> > > Supported ports: [ TP ]
> > > Supported link modes:   10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > >                         2500baseT/Full
> > > Supported pause frame use: Symmetric
> > > Supports auto-negotiation: Yes
> > > Supported FEC modes: Not reported
> > > Advertised link modes:  10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > >                         2500baseT/Full
> > > Advertised pause frame use: Symmetric
> > > Advertised auto-negotiation: Yes
> > > Advertised FEC modes: Not reported
> > > Speed: 1000Mb/s
> > > Duplex: Full
> > > Auto-negotiation: on
> > > Port: Twisted Pair
> > > PHYAD: 0
> > > Transceiver: internal
> > > MDI-X: off (auto)
> > > netlink error: Operation not permitted
> > >         Current message level: 0x00000007 (7)
> > >                                drv probe link
> > > Link detected: yes
> > > ----
> > >
> > > Requeues should only happen if the driver is overwhelmed on the tx
> > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > was not a very common event.
> > > That can certainly be justified today with several explanations if: a)
> > > modern processors getting faster b) the tx code path has become more
> > > efficient (true from inspection and your results but those patches are
> > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > requeues (need to look at the code). d) the driver is too eager to
> > > return TX BUSY.
> > >
> > > Thoughts?
> >
> > requeues can happen because some drivers do not use skb->len for the
> > BQL budget, but something bigger for GSO packets,
> > because they want to account for the (N) headers.
> >
> > So the core networking stack could pull too many packets from the
> > qdisc for one xmit_more batch,
> > then ndo_start_xmit() at some point stops the queue before the end of
> > the batch, because BQL limit is hit sooner.
> >
> > I think drivers should not be overzealous, BQL is a best effort, we do
> > not care of extra headers.
> >
> > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> >
> > igc_tso() ...
> >
> > /* update gso size and bytecount with header size */
> > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> >
>
>
> Ok, the 25G i40e driver we are going to run tests on seems to be
> suffering from the same enthusiasm ;->
> I guess the same codebase..
> Very few drivers tho seem to be doing what you suggest. Of course idpf
> being one of those ;->

Note that few requeues are ok.

In my case, I had 5 millions requeues per second, and at that point
you start noticing something is wrong ;)