netdev - Re: Incomplete fix for recent bug in tc / hfsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d23fe619-240a-4790-9edd-bec7ab22a974@gmail.com>
Date: Thu, 26 Jun 2025 10:08:31 +0200
From: Lion Ackermann <nnamrec@...il.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org,
 Jiri Pirko <jiri@...nulli.us>, Mingi Cho <mincho@...ori.io>
Subject: Re: Incomplete fix for recent bug in tc / hfsc

Hi,

On 6/25/25 4:22 PM, Jamal Hadi Salim wrote:
> On Tue, Jun 24, 2025 at 6:43 AM Lion Ackermann <nnamrec@...il.com> wrote:
>>
>> Hi,
>>
>> On 6/24/25 11:24 AM, Lion Ackermann wrote:
>>> Hi,
>>>
>>> On 6/24/25 6:41 AM, Cong Wang wrote:
>>>> On Mon, Jun 23, 2025 at 12:41:08PM +0200, Lion Ackermann wrote:
>>>>> Hello,
>>>>>
>>>>> I noticed the fix for a recent bug in sch_hfsc in the tc subsystem is
>>>>> incomplete:
>>>>>     sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()
>>>>>     https://lore.kernel.org/all/20250518222038.58538-2-xiyou.wangcong@gmail.com/
>>>>>
>>>>> This patch also included a test which landed:
>>>>>     selftests/tc-testing: Add an HFSC qlen accounting test
>>>>>
>>>>> Basically running the included test case on a sanitizer kernel or with
>>>>> slub_debug=P will directly reveal the UAF:
>>>>
>>>> Interesting, I have SLUB debugging enabled in my kernel config too:
>>>>
>>>> CONFIG_SLUB_DEBUG=y
>>>> CONFIG_SLUB_DEBUG_ON=y
>>>> CONFIG_SLUB_RCU_DEBUG=y
>>>>
>>>> But I didn't catch this bug.
>>>>
>>>
>>> Technically the class deletion step which triggered the sanitizer was not
>>> present in your testcase. The testcase only left the stale pointer which was
>>> never accessed though.
>>>
>>>>> To be completely honest I do not quite understand the rationale behind the
>>>>> original patch. The problem is that the backlog corruption propagates to
>>>>> the parent _before_ parent is even expecting any backlog updates.
>>>>> Looking at f.e. DRR: Child is only made active _after_ the enqueue completes.
>>>>> Because HFSC is messing with the backlog before the enqueue completed,
>>>>> DRR will simply make the class active even though it should have already
>>>>> removed the class from the active list due to qdisc_tree_backlog_flush.
>>>>> This leaves the stale class in the active list and causes the UAF.
>>>>>
>>>>> Looking at other qdiscs the way DRR handles child enqueues seems to resemble
>>>>> the common case. HFSC calling dequeue in the enqueue handler violates
>>>>> expectations. In order to fix this either HFSC has to stop using dequeue or
>>>>> all classful qdiscs have to be updated to catch this corner case where
>>>>> child qlen was zero even though the enqueue succeeded. Alternatively HFSC
>>>>> could signal enqueue failure if it sees child dequeue dropping packets to
>>>>> zero? I am not sure how this all plays out with the re-entrant case of
>>>>> netem though.
>>>>
>>>> I think this may be the same bug report from Mingi in the security
>>>> mailing list. I will take a deep look after I go back from Open Source
>>>> Summit this week. (But you are still very welcome to work on it by
>>>> yourself, just let me know.)
>>>>
>>>> Thanks!
>>>
>>>> My suggestion is we go back to a proposal i made a few moons back (was
>>>> this in a discussion with you? i dont remember): create a mechanism to
>>>> disallow certain hierarchies of qdiscs based on certain attributes,
>>>> example in this case disallow hfsc from being the ancestor of "qdiscs that may
>>>> drop during peek" (such as netem). Then we can just keep adding more
>>>> "disallowed configs" that will be rejected via netlink. Similar idea
>>>> is being added to netem to disallow double duplication, see:
>>>> https://lore.kernel.org/netdev/20250622190344.446090-1-will@willsroot.io/
>>>>
>>>> cheers,
>>>> jamal
>>>
>>> I vaguely remember Jamal's proposal from a while back, and I believe there was
>>> some example code for this approach already?
>>> Since there is another report you have a better overview, so it is probably
>>> best you look at it first. In the meantime I can think about the solution a
>>> bit more and possibly draft something if you wish.
>>>
>>> Thanks,
>>> Lion
>>
>> Actually I was intrigued, what do you think about addressing the root of the
>> use-after-free only and ignore the backlog corruption (kind of). After the
>> recent patches where qlen_notify may get called multiple times, we could simply
>> loosen qdisc_tree_reduce_backlog to always notify when the qdisc is empty.
>> Since deletion of all qdiscs will run qdisc_reset / qdisc_purge_queue at one
>> point or another, this should always catch left-overs. And we need not care
>> about all the complexities involved of keeping the backlog right and / or
>> prevent certain hierarchies which seems rather tedious.
>> This requires some more testing, but I was imagining something like this:
>>
>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>> --- a/net/sched/sch_api.c
>> +++ b/net/sched/sch_api.c
>> @@ -780,15 +780,12 @@ static u32 qdisc_alloc_handle(struct net_device *dev)
>>
>>  void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
>>  {
>> -       bool qdisc_is_offloaded = sch->flags & TCQ_F_OFFLOADED;
>>         const struct Qdisc_class_ops *cops;
>>         unsigned long cl;
>>         u32 parentid;
>>         bool notify;
>>         int drops;
>>
>> -       if (n == 0 && len == 0)
>> -               return;
>>         drops = max_t(int, n, 0);
>>         rcu_read_lock();
>>         while ((parentid = sch->parent)) {
>> @@ -797,17 +794,8 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
>>
>>                 if (sch->flags & TCQ_F_NOPARENT)
>>                         break;
>> -               /* Notify parent qdisc only if child qdisc becomes empty.
>> -                *
>> -                * If child was empty even before update then backlog
>> -                * counter is screwed and we skip notification because
>> -                * parent class is already passive.
>> -                *
>> -                * If the original child was offloaded then it is allowed
>> -                * to be seem as empty, so the parent is notified anyway.
>> -                */
>> -               notify = !sch->q.qlen && !WARN_ON_ONCE(!n &&
>> -                                                      !qdisc_is_offloaded);
>> +               /* Notify parent qdisc only if child qdisc becomes empty. */
>> +               notify = !sch->q.qlen;
>>                 /* TODO: perform the search on a per txq basis */
>>                 sch = qdisc_lookup(qdisc_dev(sch), TC_H_MAJ(parentid));
>>                 if (sch == NULL) {
>> @@ -816,6 +804,9 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
>>                 }
>>                 cops = sch->ops->cl_ops;
>>                 if (notify && cops->qlen_notify) {
>> +                       /* Note that qlen_notify must be idempotent as it may get called
>> +                        * multiple times.
>> +                        */
>>                         cl = cops->find(sch, parentid);
>>                         cops->qlen_notify(sch, cl);
>>                 }
>>
> 
> I believe this will fix the issue. My concern is we are not solving
> the root cause. I also posted a bunch of fixes on related issues for
> something Mingi Cho (on Cc) found - see attachments, i am not in favor
> of these either.
> Most of these setups are nonsensical. After seeing so many of these my
> view is we start disallowing such hierarchies.
> 
> cheers,
> jamal

I would also disagree with the attached patches for various reasons:
- The QFQ patch relies on packet size backlog, which is not to be
  trusted because of several sources that may make this unreliable
  (netem, size tables, GSO, etc.)
- In the TBF variant the ret may get overwritten during the loop,
  so it only relies on the final packet status. I would not trust
  this always working either.
- DRR fix seems fine, but it still requires all other qdiscs to 
  be correct (and something similar needs to be applied to all
  classfull qdiscs?)
- The changes to qdisc_tree_reduce_backlog do not really make sense
  to me I must be missing something here..

What do you think the root cause is here? AFAIK what all the issues 
have in common is that eventually qlen_notify is _not_ called, 
thus leaving stale class pointers. Naturally the consequence 
could be to simply always call qlen_notify on class deletion and 
make classfull qdiscs aware that it may get called on inactive 
classes. And this is what I tried with my proposal.
This does not solve the backlog issues though. But the pressing 
issue seems to be the uaf and not the statistic counters?

My concern with preventing certain hierarchies is that we would 
hide the backlog issues and we would be chasing bad hierarchies.
Still it would also solve all the problems eventually I guess.

Thanks,
Lion