netdev - Re: [PATCH v1 net-next 5/5] net: dev_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87ms4ulz7q.fsf@toke.dk>
Date: Mon, 10 Nov 2025 15:49:29 +0100
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Jonas Köppeler <j.koeppeler@...berlin.de>, "David S .
 Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
 <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Jamal Hadi Salim
 <jhs@...atatu.com>, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko
 <jiri@...nulli.us>, Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de
 Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
 eric.dumazet@...il.com
Subject: Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption

Eric Dumazet <edumazet@...gle.com> writes:

> On Mon, Nov 10, 2025 at 3:31 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>>
>> Eric Dumazet <edumazet@...gle.com> writes:
>>
>> > On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@...gle.com> wrote:
>> >>
>> >
>> >> I think the issue is really about TCQ_F_ONETXQUEUE :
>> >
>> > dequeue_skb() can only dequeue 8 packets at a time, then has to
>> > release the qdisc spinlock.
>>
>> So after looking at this a bit more, I think I understand more or less
>> what's going on in the interaction between cake and your llist patch:
>>
>> Basically, the llist patch moves the bottleneck from qdisc enqueue to
>> qdisc dequeue (in this setup that we're testing where the actual link
>> speed is not itself a bottleneck). Before, enqueue contends with dequeue
>> on the qdisc lock, meaning dequeue has no trouble keeping up, and the
>> qdisc never fills up.
>>
>> With the llist patch, suddenly we're enqueueing a whole batch of packets
>> every time we take the lock, which means that dequeue can no longer keep
>> up, making it the bottleneck.
>>
>> The complete collapse in throughput comes from the way cake deals with
>> unresponsive flows once the qdisc fills up: the BLUE part of its AQM
>> will drive up its drop probability to 1, where it will stay until the
>> flow responds (which, in this case, it never does).
>>
>> Turning off the BLUE algorithm prevents the throughput collapse; there's
>> still a delta compared to a stock 6.17 kernel, which I think is because
>> cake is simply quite inefficient at dropping packets in an overload
>> situation. I'll experiment with a variant of the bulk dropping you
>> introduced to fq_codel and see if that helps. We should probably also
>> cap the drop probability of BLUE to something lower than 1.
>>
>> The patch you sent (below) does not in itself help anything, but
>> lowering the constant to to 8 instead of 256 does help. I'm not sure
>> we want something that low, though; probably better to fix the behaviour
>> of cake, no?
>
> Presumably codel has a similar issue ?

Not directly, because codel is sojourn time based. Which means it
triggers only when packets stay in the queue for an extended period of
time; so as long as there's some progress being made, codel will get out
of its drop state (or not get into it in the first place). Whereas BLUE
is based solely on the fact that the queue is overflowing, and it
doesn't back off until the queue is completely empty.

BLUE was added as a mechanism to aggressively punish unresponsive flows;
I guess it's succeeding in this case? :P

> We can add to dequeue() a mechanism to queue skbs that need to be dropped
> after the spinlock and running bit are released.
>
> We did something similar in 2016 for the enqueue part [1]
>
> In 2025 this might be a bit more challenging because of eBPF qdisc.
>
> Instead of adding a new parameter, perhaps add in 'struct Qdisc' a
> *tofree pointer.
>
> I can work on a patch today.

This sounds like an excellent idea in any case - thanks! :)

-Toke