netdev - Re: [BUG] Inconsistency between qlen and backlog in hhf, fq, fq_codel, and fq_pie causing WARNING in qdisc_tree_reduce

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aGh6HiQmcgsPug1u@pop-os.localdomain>
Date: Fri, 4 Jul 2025 18:04:30 -0700
From: Cong Wang <xiyou.wangcong@...il.com>
To: William Liu <will@...lsroot.io>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	Jiri Pirko <jiri@...nulli.us>, Jamal Hadi Salim <jhs@...atatu.com>,
	Savy <savy@...t3mfailure.io>, Paolo Abeni <pabeni@...hat.com>
Subject: Re: [BUG]  Inconsistency between qlen and backlog in hhf, fq,
 fq_codel, and fq_pie causing WARNING in qdisc_tree_reduce_backlog

On Wed, Jul 02, 2025 at 08:39:45PM +0000, William Liu wrote:
> Hi,
> 
> We write to report a bug in qlen and backlog consistency affecting hhf, fq, fq_codel, and fq_pie when acting as a child of tbf. The cause of this bug was introduced by the following fix last month designed to address a null dereference bug caused by gso segmentation and a temporary inconsistency in queue state when tbf peeks at its child while running out of tokens during tbf_dequeue [1]. We actually reported that bug but did not realize the subtle problem in the fix until now. We are aware of bugs with similar symptoms reported by Mingi [3] and Lion [4], but those are of a different root cause (at least what we can see of Mingi's report).

Thanks for your report.

> 
> This works on the upstream kernel, and we have the following reproducer.
> 
> ./tc qdisc del dev lo root
> ./tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1ms || echo TBF
> ./tc qdisc add dev lo handle 3: parent 1:1 hhf limit 1000 || echo HH
> ping -I lo -f -c1 -s32 -W0.001 127.0.0.1 2>&1 >/dev/null
> ./tc qdisc change dev lo handle 3: parent 1:1 hhf limit 0 || echo HH
> ./tc qdisc replace dev lo handle 2: parent 1:1 sfq || echo SFQ 
> 
> Note that a patched version of tc that supports 0 limits must be built. The symptom of the bug arises in the WARN_ON_ONCE check in qdisc_tree_reduce_backlog [2], where n is 0.  You can replace hhf with fq, fq_codel, and fq_pie to trigger warnings as well, though the success rate may vary.
> 
> The root cause comes from the newly introduced function qdisc_dequeue_internal, which the change handler will trigger in the affected qdiscs [5]. When dequeuing from a non empty gso in this peek function, only qlen is decremented, and backlog is not considered. The gso insertion is triggered by qdisc_peek_dequeued, which tbf calls for these qdiscs when they are its child.
> 
> When replacing the qdisc, tbf_graft triggers, and qdisc_purge_queue triggers qdisc_tree_reduce_backlog with the inconsistent values, which one can observe by adding printk to the passed qlen backlog values.

If I understand you correctly, the problem is the inconsistent behavior
between qdisc_purge_queue() and qdisc_dequeue_internal()? And it is
because the former does not take care of ->gso_skb?

> 
> While historically triggering this warning often results in a UAF, it seems safe in this case to our knowledge. This warning will only trigger in tbf_graft, and this corrupted class will be removed and made inaccessible regardless. Lion's patch also looks like qlen_notify will always trigger, which is good.
> 
> However, the whole operation of qdisc_dequeue_internal in conjunction with its usage is strange. Posting the function here for reference:
> 
> static inline struct sk_buff *qdisc_dequeue_internal(struct Qdisc *sch, bool direct)
> {
>     struct sk_buff *skb;
> 
>     skb = __skb_dequeue(&sch->gso_skb);
>     if (skb) {
>         sch->q.qlen--;
>         return skb;
>     }
>     if (direct)
>         return __qdisc_dequeue_head(&sch->q);
>     else
>         return sch->dequeue(sch);
> }
> 
> The qdiscs pie, codel, fq, fq_pie, and fq_codel all adjust qlen and backlog in the same loop where they call qdisc_dequeue_internal to bring the queue back to the newly requested limit. In the gso case, this always seems incorrect as the number of dropped packets would be double counted for. In the non gso case, this looks to be fine for when direct is true, as in the case of codel and pie, but can be an issue otherwise when the dequeue handler adjusts the qlen and backlog values. In the hhf case, no action for qlen and backlog accounting is taken at all after qdisc_dequeue_internal in the loop (they just track a before and after value).

I noticed the inconsistent definition of sch->limit too, some Qdisc's
just shrink their backlog down to the limit (assuming it is smaller than
the old one), some Qdisc's just flush everything.

The reason why I didn't touch it is that it _may_ be too late to change,
since it is exposed to users, so maybe there are users expecting the
existing behaviors.

> 
> Cong, I see you posted an RFC for cleaning up GSO segmentation. Will these address this inconsistency issue?

No, actually the ->gso_skb has nothing to do with GSO segmentation. It
is a terribly misleading name, it should be named as "->peeked_skb". I
wanted to rename it but was too lazy to do so. (You are welcome to work
on this if you have time).

Thanks!