netdev - Re: [BUG] net/sched: Race Condition and Null Dereference in codel_change, pie_change, fq_pie_change, fq_codel_change, hhf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B2ZSzsBR9rUWlLkrgrMrCzqOGeSFxXIkYImvul6994v5tDSqykWo1UaWKRV-SNkNKJurgVzRcnPN07ZAVYykRaYhADyIwTxQ18OQfKDpILQ=@protonmail.com>
Date: Sun, 27 Apr 2025 21:26:43 +0000
From: Will <willsroot@...tonmail.com>
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, Savy <savy@...t3mfailure.io>, jhs@...atatu.com, jiri@...nulli.us
Subject: Re: [BUG] net/sched: Race Condition and Null Dereference in codel_change, pie_change, fq_pie_change, fq_codel_change, hhf_change

On Saturday, April 26th, 2025 at 10:56 PM, Cong Wang <xiyou.wangcong@...il.com> wrote:

> 
> 
> Hi Will,
> 
> On Fri, Apr 25, 2025 at 02:14:07PM +0000, Will wrote:
> 
> > Hi all,
> > 
> > We've encountered and triaged the following race condition that occurs across 5 different qdiscs: codel, pie, fq_pie, fq_codel, and hhf. It came up on a modified version of Syzkaller we're working on for a research project. It works on upstream (02ddfb981de88a2c15621115dd7be2431252c568), the 6.6 LTS branch, and the 6.1 LTS branch and has existed since at least 2016 (and earlier too for some of the other listed qdiscs): https://github.com/torvalds/linux/commit/2ccccf5fb43ff62b2b9.
> > 
> > We will take codel_change as the main example here, as the other vulnerable qdiscs change functions follow the same pattern. When the limit changes, the qdisc attempts to shrink the queue size back under the limit: https://elixir.bootlin.com/linux/v6.15-rc3/source/net/sched/sch_codel.c#L146. However, this is racy against a qdisc's dequeue function. This limit check could pass and the qdisc will attempt to dequeue the head, but the actual qdisc's dequeue function (codel_qdisc_dequeue in this case) could run. This would lead to the dequeued skb being null in the change function when only one packet remains in the queue, so when the function calls qdisc_pkt_len(skb), a null dereference exception would occur.
> 
> 
> Thanks for your detailed report, reproducer and patch!
> 
> I have two questions here:
> 
> 1. Why do you say it is racy? We have sch_tree_lock() held when flushing
> the packets in the backlog, it should be sufficient to prevent
> concurrent ->dequeue().
> 
> 
> 2. I don't see immediately from your report why we could get a NULL from
> __qdisc_dequeue_head(), because unless sch->q.qlen is wrong, we should
> 
> always have packets in the queue until we reach 0 (you specifically used
> 0 as the limit here).
> 
> The reason why I am asking is that if we had any of them wrong here, we
> would have a biger trouble than just missing a NULL check.
> 
> 
> Best regards,
> Cong Wang

Hi Cong,

Thank you for the reply. On further analysis, we realized that you are correct - it is not a race condition between xxx_change and xxx_dequeue. The root cause is more complicated and actually relates to the parent tbf qdisc. The bug is still a race condition though.

__qdisc_dequeue_head() can still return null even if sch->q.qlen is non-zero because of qdisc_peek_dequeued, which is the vulnerable qdiscs' peek handler, and tbf_dequeue calls it (https://elixir.bootlin.com/linux/v6.15-rc3/source/net/sched/sch_tbf.c#L280). There, the inner qdisc dequeues content before, adds it back to gso_skb, and increments qlen (https://elixir.bootlin.com/linux/v6.15-rc3/source/include/net/sch_generic.h#L1133). A queue state consistency issue arises when tbf does not have enough tokens (https://elixir.bootlin.com/linux/v6.15-rc3/source/net/sched/sch_tbf.c#L302) for dequeuing. The qlen value will be fixed when sufficient tokens exist and the watchdog fires again. However, there is a window for the inner qdisc to encounter this inconsistency and thus hit the null dereference.

Savy made this diagram below to showcase the interactions to trigger the bug.

Packet 1 is sent:

    tbf_enqueue()
        qdisc_enqueue()
            codel_qdisc_enqueue() // Codel qlen is 0
                qdisc_enqueue_tail()
                // Packet 1 is added to the queue
                // Codel qlen = 1

    tbf_dequeue()
        qdisc_peek_dequeued()
            skb_peek(&sch->gso_skb) // sch->gso_skb is empty
            codel_qdisc_dequeue() // Codel qlen is 1
                qdisc_dequeue_head()
                // Packet 1 is removed from the queue
                // Codel qlen = 0
            __skb_queue_head(&sch->gso_skb, skb); // Packet 1 is added to gso_skb list
            sch->q.qlen++ // Codel qlen = 1
        qdisc_dequeue_peeked()
            skb = __skb_dequeue(&sch->gso_skb) // Packet 1 is removed from the gso_skb list
            sch->q.qlen-- // Codel qlen = 0

Packet 2 is sent:

    tbf_enqueue()
        qdisc_enqueue()
            codel_qdisc_enqueue() // Codel qlen is 0
                qdisc_enqueue_tail()
                // Packet 2 is added to the queue
                // Codel qlen = 1

    tbf_dequeue()
        qdisc_peek_dequeued()
            skb_peek(&sch->gso_skb) // sch->gso_skb is empty
            codel_qdisc_dequeue() // Codel qlen is 1
                qdisc_dequeue_head()
                // Packet 2 is removed from the queue
                // Codel qlen = 0
            __skb_queue_head(&sch->gso_skb, skb); // Packet 2 is added to gso_skb list
            sch->q.qlen++ // Codel qlen = 1

        // TBF runs out of tokens and reschedules itself for later
        qdisc_watchdog_schedule_ns()


Notice here how codel is left in an "inconsistent" state, as sch->q.qlen > 0, but there are no packets left in the codel queue (sch->q.head is NULL)

At this point, codel_change() can be used to update the limit to 0. However, even if  sch->q.qlen > 0, there are no packets in the queue, so __qdisc_dequeue_head() returns NULL and the null-ptr-deref occurs.

Here is an updated PoC below that triggers this with just two packets. The same patch with an updated commit message is below as well. 

#!/bin/bash

# --------------------------------------------------------------------
# Authors: Will (will@...lsroot.io) and Savy (savy@...t3mfailure.io)
# Description: null-ptr-deref across 4 qdiscs from xxx_change function
# --------------------------------------------------------------------

if [ $# -eq 0 ]; then
        echo "Usage: $0 <codel/pie/fq_codel/fq_pie/hhf>"
        exit 1
fi

crash() {
    local name=$1
    ./tc qdisc add dev lo root handle 1: tbf rate 8bit burst 100b latency 1s
    ./tc qdisc add dev lo parent 1: handle 2: "$name" limit 1
    ping -I lo -f -c2 -s48 -W0.001 127.0.0.1
    ./tc qdisc change dev lo parent 1: handle 2: "$name" limit 0
}

case $1 in
        codel|pie|fq_codel|fq_pie|hhf)
                echo "Triggering crash in $1..."
                crash "$1"
                ;;
        *)
                echo "Error: Invalid qdisc!"
                echo "Usage: $0 <codel/pie/fq_codel/fq_pie/hhf>"
                exit 1
                ;;
esac

Best,
Will (will@...lsroot.io)
Savy (savy@...t3mfailure.io)

>From 58114954b375868497ed9850db7b983e0f61589f Mon Sep 17 00:00:00 2001
From: William Liu <will@...lsroot.io>
Date: Sun, 27 Apr 2025 13:59:03 -0700
Subject: [PATCH] net/sched: Exit dequeue loop if empty queue in change
 handlers for codel, pie, fq_codel, fq_pie, and hhf Qdisc_ops

When the inner qdisc of a TBF qdisc dequeue packets to reach the newly
set limit in its change handler, the qlen variable might be inconsistent
with the actual queue length due to the TBF qdisc moving elements to
gso_skb without sufficient tokens in tbf_dequeue. A TBF watchdog
addresses this inconsistency later on, but the change handler can see
this inconsistent state and result in a null dereference in codel_change,
pie_change, fq_codel_change, fq_pie_change, or a soft lockup as in
hhf_change.

Check to ensure that the skb is not null before continuing.

Signed-off-by: William Liu <will@...lsroot.io>
Signed-off-by: Savino Dicanosa <savy@...t3mfailure.io>
---
 net/sched/sch_codel.c    | 3 +++
 net/sched/sch_fq_codel.c | 3 +++
 net/sched/sch_fq_pie.c   | 3 +++
 net/sched/sch_hhf.c      | 3 +++
 net/sched/sch_pie.c      | 3 +++
 5 files changed, 15 insertions(+)

diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index 12dd71139da3..e714f324a85b 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -146,6 +146,9 @@ static int codel_change(struct Qdisc *sch, struct nlattr *opt,
    while (sch->q.qlen > sch->limit) {
        struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
 
+       if (!skb)
+           break;
+
        dropped += qdisc_pkt_len(skb);
        qdisc_qstats_backlog_dec(sch, skb);
        rtnl_qdisc_drop(skb, sch);
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 6c9029f71e88..85d591b430e4 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -443,6 +443,9 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
           q->memory_usage > q->memory_limit) {
        struct sk_buff *skb = fq_codel_dequeue(sch);
 
+       if (!skb)
+           break;
+
        q->cstats.drop_len += qdisc_pkt_len(skb);
        rtnl_kfree_skbs(skb, skb);
        q->cstats.drop_count++;
diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c
index f3b8203d3e85..f52e355698ca 100644
--- a/net/sched/sch_fq_pie.c
+++ b/net/sched/sch_fq_pie.c
@@ -368,6 +368,9 @@ static int fq_pie_change(struct Qdisc *sch, struct nlattr *opt,
    while (sch->q.qlen > sch->limit) {
        struct sk_buff *skb = fq_pie_qdisc_dequeue(sch);
 
+       if (!skb)
+           break;
+
        len_dropped += qdisc_pkt_len(skb);
        num_dropped += 1;
        rtnl_kfree_skbs(skb, skb);
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index 44d9efe1a96a..c9da90689e0c 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -566,6 +566,9 @@ static int hhf_change(struct Qdisc *sch, struct nlattr *opt,
    while (sch->q.qlen > sch->limit) {
        struct sk_buff *skb = hhf_dequeue(sch);
 
+       if (!skb)
+           break;
+
        rtnl_kfree_skbs(skb, skb);
    }
    qdisc_tree_reduce_backlog(sch, qlen - sch->q.qlen,
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 3771d000b30d..6810126b8368 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -197,6 +197,9 @@ static int pie_change(struct Qdisc *sch, struct nlattr *opt,
    while (sch->q.qlen > sch->limit) {
        struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
 
+       if (!skb)
+           break;
+
        dropped += qdisc_pkt_len(skb);
        qdisc_qstats_backlog_dec(sch, skb);
        rtnl_qdisc_drop(skb, sch);
-- 
2.43.0