linux-kernel - Re: [syzbot] [net?] INFO: rcu detected stall in unix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMnux5JjmqYM_ErBZD4x3xkgYOEyn3R4oX6uBW-+OkE_sQ@mail.gmail.com>
Date:   Thu, 17 Aug 2023 12:30:26 -0400
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Vladimir Oltean <vladimir.oltean@....com>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Pedro Tammela <pctammela@...atatu.com>,
        Victor Nogueira <victor@...atatu.com>,
        syzbot <syzbot+a3618a167af2021433cd@...kaller.appspotmail.com>,
        bpf@...r.kernel.org, brauner@...nel.org, davem@...emloft.net,
        edumazet@...gle.com, jiri@...dia.com, linux-kernel@...r.kernel.org,
        netdev@...r.kernel.org, pabeni@...hat.com,
        syzkaller-bugs@...glegroups.com,
        Vinicius Costa Gomes <vinicius.gomes@...el.com>
Subject: Re: [syzbot] [net?] INFO: rcu detected stall in unix_release

On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@....com> wrote:
>
> Hi Jakub,
>
> On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote:
> > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> > > Hi Vladimir, any ideas for this one?
> > > The bisection looks pooped, FWIW, looks like a taprio inf loop.
> >
> > I'm looking into it.
>
> Here's what I've found out and what help I'll need going forward.
>
> Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(),
> leading to an RCU stall.
>
> Short description of taprio_dequeue_tc_priority(): it cycles
> q->cur_txq[tc] in the range between [ offset, offset + count ), where:
>
>         int offset = dev->tc_to_txq[tc].offset;
>         int count = dev->tc_to_txq[tc].count;
>
> with the initial q->cur_txq[tc], aka the "first_txq" variable, being set
> by the control path: taprio_change(), also called by taprio_init():
>
>         if (mqprio) {
>                 (...)
>                 for (i = 0; i < mqprio->num_tc; i++) {
>                         (...)
>                         q->cur_txq[i] = mqprio->offset[i];
>                 }
>         }
>
> In the buggy case that leads to the RCU stall, the line in taprio_change()
> which sets q->cur_txq[i] never gets executed. So first_txq will be 0
> (pre-initialized memory), and if that's outside of the [ offset, offset + count )
> range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects
> to cycle through, the kernel is toast.
>
> The nitty gritty of that is boring. What's not boring is how come the
> control path skips the q->cur_txq[i] assignment. It's because "mqprio"
> is NULL, and that's because taprio_change() (important: also tail-called
> from taprio_init()) has this logic to detect a change in the traffic
> class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP
> netlink attribute:
>
>         /* no changes - no new mqprio settings */
>         if (!taprio_mqprio_cmp(q, dev, mqprio))
>                 mqprio = NULL;
>
> And what happens is that:
> - we go through taprio_init()
> - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us
> - taprio_mqprio_cmp() sees that there's no change compared to the
>   netdev's existing traffic class config
> - taprio_change() sets "mqprio" to NULL, ignoring the given
>   TCA_TAPRIO_ATTR_PRIOMAP
> - we skip modifying q->cur_txq[i], as if it was a taprio_change() call
>   that came straight from Qdisc_ops :: change(), rather than what it
>   really is: one from Qdisc_ops :: init()
>
> So the next question: why does taprio_mqprio_cmp() see that there's no
> change? Because there is no change. When Qdisc_ops :: init() is called,
> the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and
> all that.
>
> But why? A previous taprio, if that existed, will call taprio_destroy()
> -> netdev_reset_tc(), so it won't leave state behind that will hinder
> the current taprio. Checking for stuff in the netdev state is just so
> that taprio_change() can distinguish between a direct Qdisc_ops :: change()
> call vs one coming from init().
>
> Finally, here's where the syzbot repro becomes relevant. It crafts the
> RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc()
> in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written
> in mind.
>
> With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is
> supposed to call init() the first time and replace() the second time.
> What the repro does is make the above sequence call two init() methods
> back to back.
>
> To create an iproute2-based reproducer rather than the C one provided by
> syzbot, we need this iproute2 change:
>
> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> index 56086c43b7fa..20d9622b6bf3 100644
> --- a/tc/tc_qdisc.c
> +++ b/tc/tc_qdisc.c
> @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv)
>                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
>         if (matches(*argv, "change") == 0)
>                 return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1);
> +       if (strcmp(*argv, "replace-exclusive") == 0)
> +               return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1);
>         if (matches(*argv, "replace") == 0)
>                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1);
>         if (matches(*argv, "link") == 0)
>
> which basically implements a crafted alternative of "tc qdisc replace"
> which also sets the NLM_F_EXCL flag in n->nlmsg_flags.
>
> Then, the minimal repro script can simply be expressed as:
>
> #!/bin/bash
>
> ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1
> ip link set veth0 up && ip link set veth1 up
>
> for ((i = 0; i < 2; i++)); do
>         tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \
>                 num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
>                 queues 8@0 4@8 \
>                 clockid REALTIME \
>                 base-time 0 \
>                 cycle-time 61679 \
>                 sched-entry S 0 54336 \
>                 sched-entry S 0x8a27 7343 \
>                 max-sdu 18343 18343 \
>                 flags 0
> done
>
> ip link del veth0
>
> Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init()
> code path instead of change() for the second Qdisc.
>
> The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself)
> to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs.
>
> When the taprio_init() method executes for i=1, taprio_destroy() hasn't
> been called yet. So neither has netdev_reset_tc() been called, and
> that's part of the problem (the one that causes the infinite loop in
> dequeue()).
>
> But, taprio_destroy() will finally get called for the initial taprio
> created at i=0. The call trace looks like this:
>
>  rtnetlink_rcv_msg()
>  -> tc_modify_qdisc()
>     -> qdisc_graft()
>        -> taprio_attach() for i=1
>           -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0
>              -> __qdisc_destroy()
>                 -> taprio_destroy()
>
> What's more interesting is that the late taprio_destroy() for i=0
> effectively destroys the netdev state - the netdev_reset_tc() call -
> done by taprio_init() -> taprio_change() for i=1, and that can't be
> too good, either. Even if there's no immediately observable hang, the
> traffic classes are reset even though the Qdisc thinks they aren't.
>
> Taprio isn't the only one affected by this. Mqprio also has the pattern
> of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy().
> But with the possibility of destroy(i=0) not being serialized with
> init(i=1), that's buggy.
>
> Sorry for the long message. This is where I'm at. For me, this is the
> bottom of where things are intuitive. I don't understand what is
> considered to be expected behavior from tc_modify_qdisc(), and what is
> considered to be sane Qdisc-facing API, and I need help.
>
> I've completely stopped debugging when I saw that the code enters
> through this path at i=1, so I really can't tell you more:
>
>                                 /* This magic test requires explanation.
>                                  *
>                                  *   We know, that some child q is already
>                                  *   attached to this parent and have choice:
>                                  *   either to change it or to create/graft new one.
>                                  *
>                                  *   1. We are allowed to create/graft only
>                                  *   if CREATE and REPLACE flags are set.
>                                  *
>                                  *   2. If EXCL is set, requestor wanted to say,
>                                  *   that qdisc tcm_handle is not expected
>                                  *   to exist, so that we choose create/graft too.
>                                  *
>                                  *   3. The last case is when no flags are set.
>                                  *   Alas, it is sort of hole in API, we
>                                  *   cannot decide what to do unambiguously.
>                                  *   For now we select create/graft, if
>                                  *   user gave KIND, which does not match existing.
>                                  */
>                                 if ((n->nlmsg_flags & NLM_F_CREATE) &&
>                                     (n->nlmsg_flags & NLM_F_REPLACE) &&
>                                     ((n->nlmsg_flags & NLM_F_EXCL) ||
>                                      (tca[TCA_KIND] &&
>                                       nla_strcmp(tca[TCA_KIND], q->ops->id)))) {
>                                         netdev_err(dev, "magic test\n");
>                                         goto create_n_graft;
>                                 }
>
> I've added more Qdisc people to the discussion. The problem description
> is pretty much self-contained in this email, and going to the original
> syzbot report won't bring much else.
>

I will take a look tommorow.

cheers,
jamal

> There are multiple workarounds that can be done in taprio (and mqprio)
> depending on what is considered as being sane API. Though I don't want
> to get ahead of myself. Maybe there is a way to fast-forward the
> qdisc_destroy() of the previous taprio so it doesn't overlap with the
> new one's qdisc_create().