netdev - Re: [PATCH v1 net-next 5/5] net: dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLWsYDErNJNVhTOk7PfmMjV53kLa720RYXOBCu3gjvS=w@mail.gmail.com>
Date: Mon, 10 Nov 2025 05:26:58 -0800
From: Eric Dumazet <edumazet@...gle.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Jonas Köppeler <j.koeppeler@...berlin.de>, 
	"David S . Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Simon Horman <horms@...nel.org>, Jamal Hadi Salim <jhs@...atatu.com>, 
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org, 
	eric.dumazet@...il.com
Subject: Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption

On Mon, Nov 10, 2025 at 3:31 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Eric Dumazet <edumazet@...gle.com> writes:
>
> > On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >>
> >
> >> I think the issue is really about TCQ_F_ONETXQUEUE :
> >
> > dequeue_skb() can only dequeue 8 packets at a time, then has to
> > release the qdisc spinlock.
>
> So after looking at this a bit more, I think I understand more or less
> what's going on in the interaction between cake and your llist patch:
>
> Basically, the llist patch moves the bottleneck from qdisc enqueue to
> qdisc dequeue (in this setup that we're testing where the actual link
> speed is not itself a bottleneck). Before, enqueue contends with dequeue
> on the qdisc lock, meaning dequeue has no trouble keeping up, and the
> qdisc never fills up.
>
> With the llist patch, suddenly we're enqueueing a whole batch of packets
> every time we take the lock, which means that dequeue can no longer keep
> up, making it the bottleneck.
>
> The complete collapse in throughput comes from the way cake deals with
> unresponsive flows once the qdisc fills up: the BLUE part of its AQM
> will drive up its drop probability to 1, where it will stay until the
> flow responds (which, in this case, it never does).
>
> Turning off the BLUE algorithm prevents the throughput collapse; there's
> still a delta compared to a stock 6.17 kernel, which I think is because
> cake is simply quite inefficient at dropping packets in an overload
> situation. I'll experiment with a variant of the bulk dropping you
> introduced to fq_codel and see if that helps. We should probably also
> cap the drop probability of BLUE to something lower than 1.
>
> The patch you sent (below) does not in itself help anything, but
> lowering the constant to to 8 instead of 256 does help. I'm not sure
> we want something that low, though; probably better to fix the behaviour
> of cake, no?

Presumably codel has a similar issue ?

We can add to dequeue() a mechanism to queue skbs that need to be dropped
after the spinlock and running bit are released.

We did something similar in 2016 for the enqueue part [1]

In 2025 this might be a bit more challenging because of eBPF qdisc.

Instead of adding a new parameter, perhaps add in 'struct Qdisc' a
*tofree pointer.

I can work on a patch today.

[1]
commit 520ac30f45519b0a82dd92117c181d1d6144677b
Author: Eric Dumazet <edumazet@...gle.com>
Date:   Tue Jun 21 23:16:49 2016 -0700

    net_sched: drop packets after root qdisc lock is released

    Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet <edumazet@...gle.com>
    Acked-by: Jesper Dangaard Brouer <brouer@...hat.com>
    Signed-off-by: David S. Miller <davem@...emloft.net>




>
> -Toke
>
> >> Perhaps we should not accept q->limit packets in the ll_list, but a
> >> much smaller limit.
> >
> > I will test something like this
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 69515edd17bc6a157046f31b3dd343a59ae192ab..e4187e2ca6324781216c073de2ec20626119327a
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4185,8 +4185,12 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >         first_n = READ_ONCE(q->defer_list.first);
> >         do {
> >                 if (first_n && !defer_count) {
> > +                       unsigned long total;
> > +
> >                         defer_count = atomic_long_inc_return(&q->defer_count);
> > -                       if (unlikely(defer_count > q->limit)) {
> > +                       total = defer_count + READ_ONCE(q->q.qlen);
> > +
> > +                       if (unlikely(defer_count > 256 || total >
> > READ_ONCE(q->limit))) {
> >                                 kfree_skb_reason(skb,
> > SKB_DROP_REASON_QDISC_DROP);
> >                                 return NET_XMIT_DROP;
> >                         }
>