[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAM0EoMnxaMCmbKfgQ_=rQ2HcbmPBBqdMiOkjTe2ShbOKow0K1g@mail.gmail.com>
Date: Mon, 23 Jun 2025 10:43:47 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Lion Ackermann <nnamrec@...il.com>
Cc: netdev@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>,
Jiri Pirko <jiri@...nulli.us>
Subject: Re: Incomplete fix for recent bug in tc / hfsc
Hi,
On Mon, Jun 23, 2025 at 6:41 AM Lion Ackermann <nnamrec@...il.com> wrote:
>
> Hello,
>
> I noticed the fix for a recent bug in sch_hfsc in the tc subsystem is
> incomplete:
> sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()
> https://lore.kernel.org/all/20250518222038.58538-2-xiyou.wangcong@gmail.com/
>
> This patch also included a test which landed:
> selftests/tc-testing: Add an HFSC qlen accounting test
>
> Basically running the included test case on a sanitizer kernel or with
> slub_debug=P will directly reveal the UAF:
>
> # this is from the test case:
> ip link set dev lo up
> tc qdisc add dev lo root handle 1: drr
> tc filter add dev lo parent 1: basic classid 1:1
> tc class add dev lo parent 1: classid 1:1 drr
> tc qdisc add dev lo parent 1:1 handle 2: hfsc def 1
> tc class add dev lo parent 2: classid 2:1 hfsc rt m1 8 d 1 m2 0
> tc qdisc add dev lo parent 2:1 handle 3: netem
> tc qdisc add dev lo parent 3:1 handle 4: blackhole
>
> # .. and slightly modified to show UAF:
> echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888
> tc class delete dev lo classid 1:1
> echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888
>
>
> [ ... ]
> [ 11.405423] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b83: 0000 [#1] PREEMPT SMP NOPTI
> [ 11.407511] CPU: 5 PID: 456 Comm: socat Not tainted 6.6.93 #1
> [ 11.408496] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.fc42 04/01/2014
> [ 11.410008] RIP: 0010:drr_dequeue+0x30/0x230
> [ 11.410690] Code: 55 41 54 4c 8d a7 80 01 00 00 55 48 89 fd 53 48 8b 87 80 01 00 00 49 39 c4 0f 84 84 00 00 00 48 8b 9d 80 01 00 00 48 8b 7b 10 <48> 8b 47 18 48 8b 40 38 ff d0 0f 1f 00 48 85 c0 74 57 8b 50 28 8b
> [ 11.414042] RSP: 0018:ffffc90000dff920 EFLAGS: 00010287
> [ 11.415081] RAX: ffff888104097950 RBX: ffff888104097950 RCX: ffff888106af3300
> [ 11.416878] RDX: 0000000000000001 RSI: ffff888103adb200 RDI: 6b6b6b6b6b6b6b6b
> [ 11.418429] RBP: ffff888103bbb000 R08: 0000000000000000 R09: ffffc90000dff9b8
> [ 11.419933] R10: ffffc90000dffaa0 R11: ffffc90000dffa30 R12: ffff888103bbb180
> [ 11.421333] R13: 0000000000000000 R14: ffff888103bbb0ac R15: ffff888103bbb000
> [ 11.422698] FS: 000078f12f836740(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000
> [ 11.424250] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 11.425292] CR2: 00005a9725d64000 CR3: 000000010418c004 CR4: 0000000000770ee0
> [ 11.426584] PKRU: 55555554
> [ 11.427056] Call Trace:
> [ 11.427506] <TASK>
> [ 11.427901] __qdisc_run+0x85/0x610
> [ 11.428497] __dev_queue_xmit+0x5f9/0xde0
> [ 11.429195] ? nf_hook_slow+0x3e/0xc0
> [ 11.429873] ip_finish_output2+0x2b7/0x550
> [ 11.430560] ip_send_skb+0x82/0x90
> [ 11.431195] udp_send_skb+0x15c/0x380
> [ 11.431880] udp_sendmsg+0xb91/0xfa0
> [ 11.432524] ? __pfx_ip_generic_getfrag+0x10/0x10
> [ 11.433434] ? __sys_sendto+0x1e0/0x210
> [ 11.434216] __sys_sendto+0x1e0/0x210
> [ 11.434984] __x64_sys_sendto+0x20/0x30
> [ 11.435868] do_syscall_64+0x5e/0x90
> [ 11.436631] ? apparmor_file_permission+0x82/0x180
> [ 11.437637] ? vfs_read+0x2fc/0x340
> [ 11.438520] ? exit_to_user_mode_prepare+0x1a/0x150
> [ 11.440008] ? syscall_exit_to_user_mode+0x27/0x40
> [ 11.440917] ? do_syscall_64+0x6a/0x90
> [ 11.441617] ? do_user_addr_fault+0x319/0x650
> [ 11.442524] ? exit_to_user_mode_prepare+0x1a/0x150
> [ 11.443499] entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
>
> To be completely honest I do not quite understand the rationale behind the
> original patch. The problem is that the backlog corruption propagates to
> the parent _before_ parent is even expecting any backlog updates.
> Looking at f.e. DRR: Child is only made active _after_ the enqueue completes.
> Because HFSC is messing with the backlog before the enqueue completed,
> DRR will simply make the class active even though it should have already
> removed the class from the active list due to qdisc_tree_backlog_flush.
> This leaves the stale class in the active list and causes the UAF.
>
> Looking at other qdiscs the way DRR handles child enqueues seems to resemble
> the common case. HFSC calling dequeue in the enqueue handler violates
> expectations. In order to fix this either HFSC has to stop using dequeue or
> all classful qdiscs have to be updated to catch this corner case where
> child qlen was zero even though the enqueue succeeded. Alternatively HFSC
> could signal enqueue failure if it sees child dequeue dropping packets to
> zero? I am not sure how this all plays out with the re-entrant case of
> netem though.
>
> If I can provide more help, please let me know.
>
My suggestion is we go back to a proposal i made a few moons back (was
this in a discussion with you? i dont remember): create a mechanism to
disallow certain hierarchies of qdiscs based on certain attributes,
example in this case disallow hfsc from being the ancestor of "qdiscs that may
drop during peek" (such as netem). Then we can just keep adding more
"disallowed configs" that will be rejected via netlink. Similar idea
is being added to netem to disallow double duplication, see:
https://lore.kernel.org/netdev/20250622190344.446090-1-will@willsroot.io/
cheers,
jamal
Powered by blists - more mailing lists