netdev - Re: Question: Patch:("net: sched: cbq: dont intepret cls results when asked to drop") may be not bug for branch LTS 5.10

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM0EoMmhHns_bY-JsXvrUkRhqu3xTDaRNk+cP-x=O_848R0W3Q@mail.gmail.com>
Date:   Wed, 18 Jan 2023 08:06:12 -0500
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Davide Caratti <dcaratti@...hat.com>
Cc:     Kyle Zeng <zengyhkyle@...il.com>,
        shaozhengchao <shaozhengchao@...wei.com>,
        David Miller <davem@...emloft.net>,
        Sasha Levin <sashal@...nel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: Question: Patch:("net: sched: cbq: dont intepret cls results when
 asked to drop") may be not bug for branch LTS 5.10

The reproducer the Kyle included back then was not useful - it seemed
like a cutnpaste
from some syzkaller dashboard (and didnt compile); however, for this one issue,
you can reproduce the problem by creating the infinite loop setup that
Davide describes.

The main issue is bigger than tcf_classify: It has to do with
interpretation of tcf_result
and the return codes.
I reviewed all consumers of tcf_results and only 3 (all happened to be qdiscs)
were fixed in that patch set. Note consumers include all objects in
the hierarchy
including classifiers and action.

Typically, the LinuxWay(tm) of cutting and pasting what other people before you
did works - but sometimes people forget environmental rules even when they are
documented. The main environmental rule that was at stake here is the return
(verdict) code said to drop the packet. The validity of tcf_result in
such a case is
questionable and setting it to 0 was irrelevant. So that is all the
fix had to do for -net.

The current return code is a "verdict" on what happened. Given that
there is potential
to misinterpret - as was seen here - a safer approach is to get the
return code to be either
an error/success code(eg ELOOP for the example being quoted) since
that is a more
common pattern and we store the "verdict code" in tcf_result (TC_ACT_SHOT).
I was planning to send an RFC patch for that.

I am still not clear on the correlation that Zhengchao Shao was making between
Davide's patch and this issue...

cheers,
jamal


On Wed, Jan 18, 2023 at 6:06 AM Davide Caratti <dcaratti@...hat.com> wrote:
>
> hello,
>
> On Tue, Jan 17, 2023 at 05:10:58PM -0700, Kyle Zeng wrote:
> > Hi Zhengchao,
> >
> > I'm the finder of the vulnerability. In my initial report, there was a
> > more detailed explanation of why this bug happened. But it got left
> > out in the commit message.
> > So, I'll explain it here and see whether people want to patch the
> > actual root cause of the crash.
> >
> > The underlying bug that this patch was trying to address is actually
> > in `__tcf_classify`. Notice that `struct tcf_result` is actually a
> > union type, so whenever the kernel sets res.goto_tp, it also sets
> > res.class.
>
> From what I see/remember, 'res' (struct tcf_result) is unassigned
> unless the packet is matched by a classifier (i.e. it does not return
> TC_ACT_UNSPEC).
>
> When this match happens (__tcf_classify returns non-negative) and the
> control action says TC_ACT_GOTO_CHAIN, res->goto_tp is written.
> Like you say, 'res.class' is written as well because it's a union.
>
> > And this can happen inside `tcf_action_goto_chain_exec`. In
> > other words, `tcf_action_goto_chain_exec` will set res.class. Notice
> > that goto_chain can point back to itself, which causes an infinite
> > loop. To avoid the infinite loop situation, `__tcf_classify` checks
> > how many times the loop has been executed
> > (https://elixir.bootlin.com/linux/v6.1/source/net/sched/cls_api.c#L1586),
> > if it is more than a specific number, it will mark the result as
> > TC_ACT_SHOT and then return:
> >
> > if (unlikely(limit++ >= max_reclassify_loop)) {
> >     ...
> >     return TC_ACT_SHOT;
> > }
>
> maybe there is an easier reproducer, something made of 2 TC actions.
> The first one goes to a valid chain, and then the second one (executed from
> within the chain) drops the packet. I think that unpatched CBQ scheduler
> will find 'res.class' with a value also there.
>
> > However, when it returns in the infinite loop handler, it forgets to
> > clear whatever is in the `res` variable, which still holds pointers in
> > `goto_tp`. As a result, cbq_classify will think it is a valid
> > `res.class` and causes type confusion.
> >
> > My initial proposed patch was to memset `res` before `return
> > TC_ACT_SHOT` in `__tcf_classify`, but it didn't get merged. But I
> > guess the merged patch is more generic.
>
> The merged patch looks good to me; however, I wonder if it's sufficient.
> If I well read the code, there is still the possibility of hitting the
> same problem on a patched kernel when TC_ACT_TRAP / TC_ACT_STOLEN is
> returned after a 'goto chain' when the qdisc is CBQ.
>
> I like Jamal's idea of sharing the reproducer :)


> thanks!
> --
> davide
>
>
>