linux-kernel - Re: WARNING in event_function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5j+Z=ns37DdNU-mYtfio1cW4c25jyAGwFx74YFXKM5SD1A@mail.gmail.com>
Date:   Tue, 12 Feb 2019 19:40:12 -0800
From:   Kees Cook <keescook@...omium.org>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Dmitry Vyukov <dvyukov@...gle.com>,
        syzbot <syzbot+352bd10e338d9a90e5e0@...kaller.appspotmail.com>,
        Abderrahmane Benbachir <abderrahmane.benbachir@...ymtl.ca>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>
Subject: Re: WARNING in event_function_local

On Tue, Feb 12, 2019 at 6:14 PM Steven Rostedt <rostedt@...dmis.org> wrote:
> On Tue, 12 Feb 2019 17:27:56 -0800
> Kees Cook <keescook@...omium.org> wrote:
>
> > On Wed, May 9, 2018 at 10:18 PM Dmitry Vyukov <dvyukov@...gle.com> wrote:
> > >
> > > On Sat, Jan 27, 2018 at 8:58 PM, syzbot
> > > <syzbot+352bd10e338d9a90e5e0@...kaller.appspotmail.com> wrote:
> > > > Hello,
> > > >
> > > > syzbot hit the following crash on bpf-next commit
> > > > 8223967fe0b8eb2448cca5cfe3c64a0838e6f60d (Sat Jan 27 01:06:23 2018 +0000)
> > > > Merge branch 'fix-lpm-map'
> > > >
> > > > So far this crash happened 2 times on bpf-next.
> > > > C reproducer is attached.
> > > > syzkaller reproducer is attached.
> > > > Raw console output is attached.
> > > > compiler: gcc (GCC) 7.1.1 20170620
> > > > .config is attached.
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+352bd10e338d9a90e5e0@...kaller.appspotmail.com
> > > > It will help syzbot understand when the bug is fixed. See footer for
> > > > details.
> > > > If you forward the report, please keep this part and the footer.
> > >
> > >
> > > This was bisected to:
> > >
> > > commit 4ee7c60de83ac01fa4c33c55937357601631e8ad
> > > Author: Steven Rostedt (VMware) <rostedt@...dmis.org>
> > > Date:   Fri Mar 23 10:18:03 2018 -0400
> > >
> > >     init, tracing: Add initcall trace events
> > >
> > >
> > >
> > > > WARNING: CPU: 1 PID: 5105 at kernel/events/core.c:331
> > > > event_function_local.constprop.102+0x494/0x560 kernel/events/core.c:331
> > > > Kernel panic - not syncing: panic_on_warn set ...
> > > >
> > > > CPU: 1 PID: 5105 Comm: syzkaller318382 Not tainted 4.15.0-rc8+ #8
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > Call Trace:
> > > >  <IRQ>
> > > >  __dump_stack lib/dump_stack.c:17 [inline]
> > > >  dump_stack+0x194/0x257 lib/dump_stack.c:53
> > > >  panic+0x1e4/0x41c kernel/panic.c:183
> > > >  __warn+0x1dc/0x200 kernel/panic.c:547
> > > >  report_bug+0x211/0x2d0 lib/bug.c:184
> > > >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> > > >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> > > >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> > > >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> > > >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096
> > > > RIP: 0010:event_function_local.constprop.102+0x494/0x560
> > > > kernel/events/core.c:331
> > > > RSP: 0018:ffff8801db307d00 EFLAGS: 00010006
> > > > RAX: ffff8801d902a180 RBX: ffff8801d60beac0 RCX: ffffffff81862a54
> > > > RDX: 0000000000010000 RSI: ffff8801db3079f0 RDI: ffff8801d60bebc8
> > > > RBP: ffff8801db307da0 R08: 1ffff1003b660edf R09: 000000000000000c
> > > > R10: ffff8801db307bb8 R11: 0000000000000044 R12: ffff8801db330e00
> > > > R13: 1ffff1003b660fa3 R14: ffff8801d90841c0 R15: ffff8801d979ce40
> > > >  perf_event_disable_local kernel/events/core.c:1979 [inline]
> > > >  perf_pending_event+0x1e5/0x240 kernel/events/core.c:5446
> > > >  irq_work_run_list+0x184/0x240 kernel/irq_work.c:155
> > > >  irq_work_run+0x1d/0x50 kernel/irq_work.c:170
> > > >  smp_irq_work_interrupt+0x103/0x640 arch/x86/kernel/irq_work.c:21
> > > >  irq_work_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:968
> > > >  </IRQ>
> >
> > I got curious why this report went silent, and bisected the "fix" to:
>
> It's still tagged as "TODO" in my INBOX, along with 1000 other emails
> tagged with TODO :-p

Totally understood. I went to go try to tackle this because it's one
of the weird "super easy to reproduce for half a year then suddenly
went silent" bugs that didn't have an obvious fix that went in.

>
> > Steve, what could possibly be happening here? Just adding more
> > tracepoints causes some kind of race where the task vs current test
> > trips in event_function_local():
> >
> >                         if (WARN_ON_ONCE(task != current))
> >                                 goto unlock;
>
> That's perf code. How are you getting there? What's special about this
> run? You have perf running?

Yes, the reproducer is in an 8-way parallel tight loop, doing:

  fd = syscall(__NR_perf_event_open, 0x20000140, 0, 0, -1, 0);
  syscall(__NR_ioctl, fd, 0x2402, 0x100000001);

I haven't decoded the structures that are passed in, but I'm at a loss
for how changing how many trace entries there are could impact the
race timing... O_o

> > Is this maybe just an unlucky condition with the event loop running in
> > an IRQ? Should the WARN be expected, or is running under an IRQ
> > unexpected?

Is perf expected to fire during an IRQ? The task == current test seems
suspicious if so...

-- 
Kees Cook