netdev - Re: Incomplete fix for recent bug in tc / hfsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoM=mey1f596GS_9-VkLyTmMqM0oJ7TuGZ6i73++tEVFAKg@mail.gmail.com>
Date: Sun, 29 Jun 2025 10:29:44 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Lion Ackermann <nnamrec@...il.com>
Cc: Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org, 
	Jiri Pirko <jiri@...nulli.us>, Mingi Cho <mincho@...ori.io>
Subject: Re: Incomplete fix for recent bug in tc / hfsc

On Sat, Jun 28, 2025 at 5:43 PM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>
> On Thu, Jun 26, 2025 at 4:08 AM Lion Ackermann <nnamrec@...il.com> wrote:
> >
> > Hi,
> >
> > On 6/25/25 4:22 PM, Jamal Hadi Salim wrote:
> > > On Tue, Jun 24, 2025 at 6:43 AM Lion Ackermann <nnamrec@...il.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On 6/24/25 11:24 AM, Lion Ackermann wrote:
> > >>> Hi,
> > >>>
> > >>> On 6/24/25 6:41 AM, Cong Wang wrote:
> > >>>> On Mon, Jun 23, 2025 at 12:41:08PM +0200, Lion Ackermann wrote:
> > >>>>> Hello,
> > >>>>>
> > >>>>> I noticed the fix for a recent bug in sch_hfsc in the tc subsystem is
> > >>>>> incomplete:
> > >>>>>     sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()
> > >>>>>     https://lore.kernel.org/all/20250518222038.58538-2-xiyou.wangcong@gmail.com/
> > >>>>>
> > >>>>> This patch also included a test which landed:
> > >>>>>     selftests/tc-testing: Add an HFSC qlen accounting test
> > >>>>>
> > >>>>> Basically running the included test case on a sanitizer kernel or with
> > >>>>> slub_debug=P will directly reveal the UAF:
> > >>>>
> > >>>> Interesting, I have SLUB debugging enabled in my kernel config too:
> > >>>>
> > >>>> CONFIG_SLUB_DEBUG=y
> > >>>> CONFIG_SLUB_DEBUG_ON=y
> > >>>> CONFIG_SLUB_RCU_DEBUG=y
> > >>>>
> > >>>> But I didn't catch this bug.
> > >>>>
> > >>>
> > >>> Technically the class deletion step which triggered the sanitizer was not
> > >>> present in your testcase. The testcase only left the stale pointer which was
> > >>> never accessed though.
> > >>>
> > >>>>> To be completely honest I do not quite understand the rationale behind the
> > >>>>> original patch. The problem is that the backlog corruption propagates to
> > >>>>> the parent _before_ parent is even expecting any backlog updates.
> > >>>>> Looking at f.e. DRR: Child is only made active _after_ the enqueue completes.
> > >>>>> Because HFSC is messing with the backlog before the enqueue completed,
> > >>>>> DRR will simply make the class active even though it should have already
> > >>>>> removed the class from the active list due to qdisc_tree_backlog_flush.
> > >>>>> This leaves the stale class in the active list and causes the UAF.
> > >>>>>
> > >>>>> Looking at other qdiscs the way DRR handles child enqueues seems to resemble
> > >>>>> the common case. HFSC calling dequeue in the enqueue handler violates
> > >>>>> expectations. In order to fix this either HFSC has to stop using dequeue or
> > >>>>> all classful qdiscs have to be updated to catch this corner case where
> > >>>>> child qlen was zero even though the enqueue succeeded. Alternatively HFSC
> > >>>>> could signal enqueue failure if it sees child dequeue dropping packets to
> > >>>>> zero? I am not sure how this all plays out with the re-entrant case of
> > >>>>> netem though.
> > >>>>
> > >>>> I think this may be the same bug report from Mingi in the security
> > >>>> mailing list. I will take a deep look after I go back from Open Source
> > >>>> Summit this week. (But you are still very welcome to work on it by
> > >>>> yourself, just let me know.)
> > >>>>
> > >>>> Thanks!
> > >>>
> > >>>> My suggestion is we go back to a proposal i made a few moons back (was
> > >>>> this in a discussion with you? i dont remember): create a mechanism to
> > >>>> disallow certain hierarchies of qdiscs based on certain attributes,
> > >>>> example in this case disallow hfsc from being the ancestor of "qdiscs that may
> > >>>> drop during peek" (such as netem). Then we can just keep adding more
> > >>>> "disallowed configs" that will be rejected via netlink. Similar idea
> > >>>> is being added to netem to disallow double duplication, see:
> > >>>> https://lore.kernel.org/netdev/20250622190344.446090-1-will@willsroot.io/
> > >>>>
> > >>>> cheers,
> > >>>> jamal
> > >>>
> > >>> I vaguely remember Jamal's proposal from a while back, and I believe there was
> > >>> some example code for this approach already?
> > >>> Since there is another report you have a better overview, so it is probably
> > >>> best you look at it first. In the meantime I can think about the solution a
> > >>> bit more and possibly draft something if you wish.
> > >>>
> > >>> Thanks,
> > >>> Lion
> > >>
> > >> Actually I was intrigued, what do you think about addressing the root of the
> > >> use-after-free only and ignore the backlog corruption (kind of). After the
> > >> recent patches where qlen_notify may get called multiple times, we could simply
> > >> loosen qdisc_tree_reduce_backlog to always notify when the qdisc is empty.
> > >> Since deletion of all qdiscs will run qdisc_reset / qdisc_purge_queue at one
> > >> point or another, this should always catch left-overs. And we need not care
> > >> about all the complexities involved of keeping the backlog right and / or
> > >> prevent certain hierarchies which seems rather tedious.
> > >> This requires some more testing, but I was imagining something like this:
> > >>
> > >> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > >> --- a/net/sched/sch_api.c
> > >> +++ b/net/sched/sch_api.c
> > >> @@ -780,15 +780,12 @@ static u32 qdisc_alloc_handle(struct net_device *dev)
> > >>
> > >>  void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
> > >>  {
> > >> -       bool qdisc_is_offloaded = sch->flags & TCQ_F_OFFLOADED;
> > >>         const struct Qdisc_class_ops *cops;
> > >>         unsigned long cl;
> > >>         u32 parentid;
> > >>         bool notify;
> > >>         int drops;
> > >>
> > >> -       if (n == 0 && len == 0)
> > >> -               return;
> > >>         drops = max_t(int, n, 0);
> > >>         rcu_read_lock();
> > >>         while ((parentid = sch->parent)) {
> > >> @@ -797,17 +794,8 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
> > >>
> > >>                 if (sch->flags & TCQ_F_NOPARENT)
> > >>                         break;
> > >> -               /* Notify parent qdisc only if child qdisc becomes empty.
> > >> -                *
> > >> -                * If child was empty even before update then backlog
> > >> -                * counter is screwed and we skip notification because
> > >> -                * parent class is already passive.
> > >> -                *
> > >> -                * If the original child was offloaded then it is allowed
> > >> -                * to be seem as empty, so the parent is notified anyway.
> > >> -                */
> > >> -               notify = !sch->q.qlen && !WARN_ON_ONCE(!n &&
> > >> -                                                      !qdisc_is_offloaded);
> > >> +               /* Notify parent qdisc only if child qdisc becomes empty. */
> > >> +               notify = !sch->q.qlen;
> > >>                 /* TODO: perform the search on a per txq basis */
> > >>                 sch = qdisc_lookup(qdisc_dev(sch), TC_H_MAJ(parentid));
> > >>                 if (sch == NULL) {
> > >> @@ -816,6 +804,9 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
> > >>                 }
> > >>                 cops = sch->ops->cl_ops;
> > >>                 if (notify && cops->qlen_notify) {
> > >> +                       /* Note that qlen_notify must be idempotent as it may get called
> > >> +                        * multiple times.
> > >> +                        */
> > >>                         cl = cops->find(sch, parentid);
> > >>                         cops->qlen_notify(sch, cl);
> > >>                 }
> > >>
> > >
> > > I believe this will fix the issue. My concern is we are not solving
> > > the root cause. I also posted a bunch of fixes on related issues for
> > > something Mingi Cho (on Cc) found - see attachments, i am not in favor
> > > of these either.
> > > Most of these setups are nonsensical. After seeing so many of these my
> > > view is we start disallowing such hierarchies.
> > >
> > > cheers,
> > > jamal
> >
> > I would also disagree with the attached patches for various reasons:
> > - The QFQ patch relies on packet size backlog, which is not to be
> >   trusted because of several sources that may make this unreliable
> >   (netem, size tables, GSO, etc.)
> > - In the TBF variant the ret may get overwritten during the loop,
> >   so it only relies on the final packet status. I would not trust
> >   this always working either.
> > - DRR fix seems fine, but it still requires all other qdiscs to
> >   be correct (and something similar needs to be applied to all
> >   classfull qdiscs?)
> > - The changes to qdisc_tree_reduce_backlog do not really make sense
> >   to me I must be missing something here..
> >
> > What do you think the root cause is here? AFAIK what all the issues
> > have in common is that eventually qlen_notify is _not_ called,
> > thus leaving stale class pointers. Naturally the consequence
> > could be to simply always call qlen_notify on class deletion and
> > make classfull qdiscs aware that it may get called on inactive
> > classes. And this is what I tried with my proposal.
> > This does not solve the backlog issues though. But the pressing
> > issue seems to be the uaf and not the statistic counters?
> >
> > My concern with preventing certain hierarchies is that we would
> > hide the backlog issues and we would be chasing bad hierarchies.
> > Still it would also solve all the problems eventually I guess.
> >
>
> On "What do you think the root cause is here?"
>
> I believe the root cause is that qdiscs like hfsc and qfq are dropping
> all packets in enqueue (mostly in relation to peek()) and that result
> is not being reflected in the return code returned to its parent
> qdisc.
> So, in the example you described in this thread, drr is oblivious to
> the fact that the child qdisc dropped its packet because the call to
> its child enqueue returned NET_XMIT_SUCCESS. This causes drr to
> activate a class that shouldn't have been activated at all.
>
> You can argue that drr (and other similar qdiscs) may detect this by
> checking the call to qlen_notify (as the drr patch was
> doing), but that seems really counter-intuitive. Imagine writing a new
> qdisc and having to check for that every time you call a child's
> enqueue. Sure  your patch solves this, but it also seems like it's not
> fixing the underlying issue (which is drr activating the class in the
> first place). Your patch is simply removing all the classes from their
> active lists when you delete them. And your patch may seem ok for now,
> but I am worried it might break something else in the future that we
> are not seeing.
>
> And do note: All of the examples of the hierarchy I have seen so far,
> that put us in this situation, are nonsensical
>

At this point my thinking is to apply your patch and then we discuss a
longer term solution. Cong?

cheers,
jamal