linux-kernel - Re: Getting empty callchain from perf_callchain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BB5859B6-6686-474A-8CB2-FAC183C92829@fb.com>
Date:   Mon, 20 May 2019 17:19:22 +0000
From:   Song Liu <songliubraving@...com>
To:     Kairui Song <kasong@...hat.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "Josh Poimboeuf" <jpoimboe@...hat.com>,
        Alexei Starovoitov <ast@...nel.org>,
        "Daniel Borkmann" <daniel@...earbox.net>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>
Subject: Re: Getting empty callchain from perf_callchain_kernel() 

Sorry for previous empty email.. Clicked send by accident. 

> On May 19, 2019, at 11:06 AM, Kairui Song <kasong@...hat.com> wrote:
> 
> On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <peterz@...radead.org> wrote:
>> 
>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>> some other bfp functions) is now broken, or, strating an unwind
>>> directly inside a bpf program will end up strangely. It have following
>>> kernel message:
>> 
>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>> follow.
> 
> bpf_get_stackid_tp will just use the regs passed to it from the trace
> point. And then it will eventually call perf_get_callchain to get the
> call chain.
> With a tracepoint we have the fake regs, so unwinder will start from
> where it is called, and use the fake regs as the indicator of the
> target frame it want, and keep unwinding until reached the actually
> callsite.
> 
> But if the stack trace is started withing a bpf func call then it's broken...
> 
> If the unwinder could trace back through the bpf func call then there
> will be no such problem.
> 
> For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
> before bpf call, and ensure bp is also dumped could fix it (so it will
> start using the regs for bpf calls, like before the commit
> d15d356887e7). But for ORC I don't see a clear way to fix the problem.
> First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
> DX, R10, R13, else?) in the trace point handler, then use the flag to
> indicate ORC to do one more unwind (because can't get caller's regs,
> so get callee's regs instaed) before actually giving output?
> 
> I had a try, for framepointer unwinder, mark the indicator flag before
> calling bpf functions, and dump bp as well in the trace point. Then
> with frame pointer, it works, test passed:
> 
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index 1392d5e6e8d6..6f1192e9776b 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
> pt_regs *regs);
> 
> #include <asm/stacktrace.h>
> 
> +#ifdef CONFIG_FRAME_POINTER
> +static inline unsigned long caller_frame_pointer(void)
> +{
> +       return (unsigned long)__builtin_frame_address(1);
> +}
> +#else
> +static inline unsigned long caller_frame_pointer(void)
> +{
> +       return 0;
> +}
> +#endif
> +
> /*
>  * We abuse bit 3 from flags to pass exact information, see perf_misc_flags
>  * and the comment with PERF_EFLAGS_EXACT.
>  */
> #define perf_arch_fetch_caller_regs(regs, __ip)                {       \
>        (regs)->ip = (__ip);                                    \
> +       (regs)->bp = caller_frame_pointer();                    \
>        (regs)->sp = (unsigned long)__builtin_frame_address(0); \
>        (regs)->cs = __KERNEL_CS;                               \
>        regs->flags = 0;                                        \
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abbd4b3b96c2..ca7b95ee74f0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
> int size, int rctx,
>                               struct task_struct *task)
> {
>        if (bpf_prog_array_valid(call)) {
> +               regs->flags |= X86_EFLAGS_FIXED;
>                *(struct pt_regs **)raw_data = regs;
>                if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
>                        perf_swevent_put_recursion_context(rctx);
> @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
>        int ret = 0;
> 
>        ctx.regs = perf_arch_bpf_user_pt_regs(regs);
> +       ctx.regs->flags |= X86_EFLAGS_FIXED;
> +
>        preempt_disable();
>        if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
>                goto out;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..e1fa656677dc 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
> flags, void *meta, u64 meta_size,
>        };
> 
>        perf_fetch_caller_regs(regs);
> +       regs->flags |= X86_EFLAGS_FIXED;
> +
>        perf_sample_data_init(sd, 0, 0);
>        sd->raw = &raw;
> 
> @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
>        struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
> 
>        perf_fetch_caller_regs(regs);
> +       regs->flags |= X86_EFLAGS_FIXED;
> +
>        return ____bpf_perf_event_output(regs, map, flags, data, size);
> }
> 
> @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
>        struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
> 
>        perf_fetch_caller_regs(regs);
> +       regs->flags |= X86_EFLAGS_FIXED;
> +
>        /* similar to bpf_perf_event_output_tp, but pt_regs fetched
> differently */
>        return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
>                               flags, 0, 0);
> @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
>        struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
> 
>        perf_fetch_caller_regs(regs);
> +       regs->flags |= X86_EFLAGS_FIXED;
> +
>        return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
>                             (unsigned long) size, flags, 0);
> }
> 

I think we really cannot do something above, as it leaks x86 specific
code into kernel/events and kernel/trace. 

> And *_raw_tp functions will fetch the regs by themselves so a bit
> trouble some...
> 
> ----------
> 
> And another approach is to make unwinder direct unwinding works when
> called by bpf (if possible and reasonable). I also had a nasty hacky
> experiment (posted below) to just force frame pointer unwinder's
> get_stack_info pass for bpf, then problem is fixed without any other
> workaround:
> 
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index 753b8cfe8b8a..c0cfdf25f5ed 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
> task_struct *task,
>        if (in_entry_stack(stack, info))
>                goto recursion_check;
> 
> -       goto unknown;
> +       goto recursion_check;
> 
> recursion_check:
>        /*

I think this one doesn't work for ORC either? 

Thanks,
Song

> Don't know how to do it the right way, or is it even possible for all
> unwinders yet...
> 
> -- 
> Best Regards,
> Kairui Song