lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4fc5d598-606d-4053-887a-d9b23586e35a@kernel.org>
Date: Mon, 10 Nov 2025 16:04:10 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>, Jakub Kicinski
 <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Simon Horman <horms@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>,
 Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn
 <willemb@...gle.com>, netdev@...r.kernel.org, eric.dumazet@...il.com,
 Toke Høiland-Jørgensen <toke@...hat.com>,
 kernel-team <kernel-team@...udflare.com>,
 Jesse Brandeburg <jbrandeburg@...udflare.com>
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches



On 10/11/2025 12.06, Eric Dumazet wrote:
> On Mon, Nov 10, 2025 at 2:36 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>>
>>
>>
>> On 09/11/2025 17.12, Eric Dumazet wrote:
>>> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
>>> I started seeing many qdisc requeues on IDPF under high TX workload.
>>>
>>> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
>>> qdisc mq 1: root
>>>    Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
>>>    backlog 1056Kb 6675p requeues 3532840114
>>> qdisc mq 1: root
>>>    Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
>>>    backlog 781164b 4822p requeues 3537737653
>>>
>>> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
>>>
>>> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
>>> ...
>>>    netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200

To Jesse, see how Eric is using tracepoint qdisc:qdisc_dequeue

>>>    netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
>>>    netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
>>>    netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
>>>    netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
>>>    netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
>>>    netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
>>> ...
>>>
>>> This is bad because :
>>>
>>> 1) Large batches hold one victim cpu for a very long time.
>>>
>>> 2) Driver often hit their own TX ring limit (all slots are used).
>>>
>>> 3) We call dev_requeue_skb()
>>>
>>> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
>>>      implement FQ or priority scheduling.
>>>
>>> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
>>>      with no xmit_more support. This is causing many spinlock games
>>>      between the qdisc and the device driver.
>>>
>>> Requeues were supposed to be very rare, lets keep them this way.
>>>
>>> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
>>> __qdisc_run() was designed to use.
>>>
>>> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
>>> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>>> Cc: Jesper Dangaard Brouer <hawk@...nel.org>
>>> Cc: Toke Høiland-Jørgensen <toke@...hat.com>
>>> ---
>>>    net/sched/sch_generic.c | 17 ++++++++++-------
>>>    1 file changed, 10 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>> index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>>>    static void try_bulk_dequeue_skb(struct Qdisc *q,
>>>                                 struct sk_buff *skb,
>>>                                 const struct netdev_queue *txq,
>>> -                              int *packets)
>>> +                              int *packets, int budget)
>>>    {
>>>        int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
>>> +     int cnt = 0;
>>
>> You patch makes perfect sense, that we want this budget limit.
>>
>> But: Why isn't bytelimit saving us?
> 
> BQL can easily grow
> /sys/class/net/eth1/queues/tx-XXX/byte_queue_limits/limit to quite big
> values with MQ high speed devices.
> 
> Each TX queue is usually serviced with RR, meaning that some of them
> can get a long standing queue.
> 
> 
> tjbp26:/home/edumazet# ./super_netperf 200 -H tjbp27 -l 100 &
> [1] 198996
> 
> tjbp26:/home/edumazet# grep .
> /sys/class/net/eth1/queues/tx-*/byte_queue_limits/limit
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:116826
> /sys/class/net/eth1/queues/tx-10/byte_queue_limits/limit:84534
> /sys/class/net/eth1/queues/tx-11/byte_queue_limits/limit:342924
> /sys/class/net/eth1/queues/tx-12/byte_queue_limits/limit:433302
> /sys/class/net/eth1/queues/tx-13/byte_queue_limits/limit:409254
> /sys/class/net/eth1/queues/tx-14/byte_queue_limits/limit:434112
> /sys/class/net/eth1/queues/tx-15/byte_queue_limits/limit:68304
> /sys/class/net/eth1/queues/tx-16/byte_queue_limits/limit:65610
> /sys/class/net/eth1/queues/tx-17/byte_queue_limits/limit:65772
> /sys/class/net/eth1/queues/tx-18/byte_queue_limits/limit:69822
> /sys/class/net/eth1/queues/tx-19/byte_queue_limits/limit:440634
> /sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-20/byte_queue_limits/limit:304824
> /sys/class/net/eth1/queues/tx-21/byte_queue_limits/limit:497856
> /sys/class/net/eth1/queues/tx-22/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-23/byte_queue_limits/limit:535408
> /sys/class/net/eth1/queues/tx-24/byte_queue_limits/limit:79419
> /sys/class/net/eth1/queues/tx-25/byte_queue_limits/limit:70170
> /sys/class/net/eth1/queues/tx-26/byte_queue_limits/limit:1595568
> /sys/class/net/eth1/queues/tx-27/byte_queue_limits/limit:579108
> /sys/class/net/eth1/queues/tx-28/byte_queue_limits/limit:430578
> /sys/class/net/eth1/queues/tx-29/byte_queue_limits/limit:647172
> /sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:345492
> /sys/class/net/eth1/queues/tx-30/byte_queue_limits/limit:612392
> /sys/class/net/eth1/queues/tx-31/byte_queue_limits/limit:344376
> /sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:154740
> /sys/class/net/eth1/queues/tx-4/byte_queue_limits/limit:60588
> /sys/class/net/eth1/queues/tx-5/byte_queue_limits/limit:71970
> /sys/class/net/eth1/queues/tx-6/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-7/byte_queue_limits/limit:695454
> /sys/class/net/eth1/queues/tx-8/byte_queue_limits/limit:101760
> /sys/class/net/eth1/queues/tx-9/byte_queue_limits/limit:65286
> 
> Then if we send many small packets in a row, limit/pkt_avg_len can go
> to arbitrary values.
> 

Thanks for sharing.

With these numbers it makes sense that BQL bytelimit isn't limiting this 
much code much.

e.g. 1595568 bytes / 1500 MTU = 1063 packets.

Our prod also have large numbers:

$ grep -H . /sys/class/net/REDACT0/queues/tx-*/byte_queue_limits/limit | 
sort -k2rn -t: | head -n 10
/sys/class/net/ext0/queues/tx-38/byte_queue_limits/limit:819432
/sys/class/net/ext0/queues/tx-95/byte_queue_limits/limit:766227
/sys/class/net/ext0/queues/tx-2/byte_queue_limits/limit:715412
/sys/class/net/ext0/queues/tx-66/byte_queue_limits/limit:692073
/sys/class/net/ext0/queues/tx-20/byte_queue_limits/limit:679817
/sys/class/net/ext0/queues/tx-61/byte_queue_limits/limit:647638
/sys/class/net/ext0/queues/tx-11/byte_queue_limits/limit:642212
/sys/class/net/ext0/queues/tx-10/byte_queue_limits/limit:615188
/sys/class/net/ext0/queues/tx-48/byte_queue_limits/limit:613745
/sys/class/net/ext0/queues/tx-80/byte_queue_limits/limit:584850

--Jesper

>>
>> Acked-by: Jesper Dangaard Brouer <hawk@...nel.org>
>>
>>>        while (bytelimit > 0) {
>>>                struct sk_buff *nskb = q->dequeue(q);
>>> @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
>>>                bytelimit -= nskb->len; /* covers GSO len */
>>>                skb->next = nskb;
>>>                skb = nskb;
>>> -             (*packets)++; /* GSO counts as one pkt */
>>> +             if (++cnt >= budget)
>>> +                     break;
>>>        }
>>> +     (*packets) += cnt;
>>>        skb_mark_not_on_list(skb);
>>>    }
>>>
>>> @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
>>>     * A requeued skb (via q->gso_skb) can also be a SKB list.
>>>     */
>>>    static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>>> -                                int *packets)
>>> +                                int *packets, int budget)
>>>    {
>>>        const struct netdev_queue *txq = q->dev_queue;
>>>        struct sk_buff *skb = NULL;
>>> @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>>>        if (skb) {
>>>    bulk:
>>>                if (qdisc_may_bulk(q))
>>> -                     try_bulk_dequeue_skb(q, skb, txq, packets);
>>> +                     try_bulk_dequeue_skb(q, skb, txq, packets, budget);
>>>                else
>>>                        try_bulk_dequeue_skb_slow(q, skb, packets);
>>>        }
>>> @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>>>     *                          >0 - queue is not empty.
>>>     *
>>>     */
>>> -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>> +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
>>>    {
>>>        spinlock_t *root_lock = NULL;
>>>        struct netdev_queue *txq;
>>> @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>>        bool validate;
>>>
>>>        /* Dequeue packet */
>>> -     skb = dequeue_skb(q, &validate, packets);
>>> +     skb = dequeue_skb(q, &validate, packets, budget);
>>>        if (unlikely(!skb))
>>>                return false;
>>>
>>> @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
>>>        int quota = READ_ONCE(net_hotdata.dev_tx_weight);
>>>        int packets;
>>>
>>> -     while (qdisc_restart(q, &packets)) {
>>> +     while (qdisc_restart(q, &packets, quota)) {
>>>                quota -= packets;
>>>                if (quota <= 0) {
>>>                        if (q->flags & TCQ_F_NOLOCK)
>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ