linux-kernel - Re: WARNING with LBR + precise_ip=2 + bpf_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200618183302.GN576905@hirez.programming.kicks-ass.net>
Date:   Thu, 18 Jun 2020 20:33:02 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Song Liu <songliubraving@...com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Tejun Heo <tj@...nel.org>, Andrii Nakryiko <andriin@...com>,
        Francis Ricci <fjricci@...com>
Subject: Re: WARNING with LBR + precise_ip=2 + bpf_get_stackid()

On Thu, Jun 18, 2020 at 05:25:24PM +0000, Song Liu wrote:

> PID: 80430  TASK: ffff888d92c62a80  CPU: 24  COMMAND: "25_scheduler"
> #0 [fffffe00004cfd88] machine_kexec at ffffffff8104a646
> #1 [fffffe00004cfdd8] __crash_kexec at ffffffff8114a82f
> #2 [fffffe00004cfea0] panic at ffffffff810ba99c
> #3 [fffffe00004cff20] df_debug at ffffffff8104e21d
> #4 [fffffe00004cff30] do_double_fault at ffffffff8101c9b4
> #5 [fffffe00004cff50] double_fault at ffffffff81c00b3e
> [exception RIP: vsnprintf+14]
> RIP: ffffffff81a3ee6e  RSP: fffffe00004d0ff8  RFLAGS: 00010082
> RAX: fffffe00004d1070  RBX: fffffe00004d1101  RCX: fffffe00004d1050
> RDX: ffffffff822c42f5  RSI: 000000007fffffff  RDI: fffffe00004d111a
> RBP: fffffe00004d10a0   R8: 0000000000000067   R9: ffffffff82209d05
> R10: ffffffff822a5fd0  R11: ffffffff822a6358  R12: 0000000000000019
> R13: 0000000000000001  R14: 0000000000000019  R15: ffffffff822b5fd6
> ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <DOUBLEFAULT exception stack> ---
> #6 [fffffe00004d0ff8] vsnprintf at ffffffff81a3ee6e
> bt: cannot transition from exception stack to current process stack:
> exception stack pointer: fffffe00004cfd88
> process stack pointer: fffffe00004d1048
> current stack base: ffffc900241a0000
> 
> 0xfffffe00004d1040 kallsyms_token_index
> 0xfffffe00004d1048 sprintf
> 0xfffffe00004d1078 kallsyms_lookup
> 0xfffffe00004d1098 kallsyms_names
> 0xfffffe00004d10a8 __sprint_symbol
> 0xfffffe00004d10d8 textbuf.47672
> 0xfffffe00004d10e0 always_kmsg_dump
> 0xfffffe00004d10f8 symbol_string
> 0xfffffe00004d11c0 textbuf.47672
> 0xfffffe00004d11e8 textbuf.47672
> 0xfffffe00004d11f8 textbuf.47672
> 0xfffffe00004d1200 always_kmsg_dump
> 0xfffffe00004d1210 kallsyms_token_index
> 0xfffffe00004d1218 vsnprintf
> 0xfffffe00004d1220 textbuf.47672
> 0xfffffe00004d1258 kallsyms_token_index
> 0xfffffe00004d1270 vscnprintf
> 0xfffffe00004d1280 vprintk_store
> 0xfffffe00004d1290 wake_up_klogd
> 0xfffffe00004d12b0 kallsyms_token_index
> 0xfffffe00004d12c8 vprintk_emit
> 0xfffffe00004d1300 entry_SYSCALL_64
> 0xfffffe00004d1318 vprintk_deferred
> 0xfffffe00004d1328 printk_deferred
> 0xfffffe00004d1360 entry_SYSCALL_64
> 0xfffffe00004d1380 __start_orc_unwind
> 0xfffffe00004d1388 unwind_next_frame.cold.7
> 0xfffffe00004d13c8 perf_callchain_kernel
> 0xfffffe00004d1418 entry_SYSCALL_64
> 0xfffffe00004d1450 get_perf_callchain
> 0xfffffe00004d14b0 bpf_get_stack

This, this is where it gets buggerd. This is a BPF problem, not a perf
problem.

You'll note that if you ask perf for a PERF_SAMPLE_CALLCHAIN we'll do
that early in setup_pebs_fixed_sample_data().

> 0xfffffe00004d1730 bpf_overflow_handler
> 0xfffffe00004d1788 __perf_event_overflow
> 0xfffffe00004d17a0 x86_pmu
> 0xfffffe00004d17b8 __intel_pmu_pebs_event
> 0xfffffe00004d17e0 setup_pebs_fixed_sample_data
> 0xfffffe00004d1890 entry_SYSCALL_64
> 0xfffffe00004d1ab8 intel_pmu_drain_pebs_nhm
> 0xfffffe00004d1ac0 setup_pebs_fixed_sample_data
> 0xfffffe00004d1bb8 handle_pmi_common
> 0xfffffe00004d1d00 insn_get_sib.part.5
> 0xfffffe00004d1d10 insn_get_displacement.part.6
> 0xfffffe00004d1d20 insn_get_immediate.part.7
> 0xfffffe00004d1d38 __set_pte_vaddr
> 0xfffffe00004d1d78 __native_set_fixmap
> 0xfffffe00004d1d80 ghes_copy_tofrom_phys
> 0xfffffe00004d1e08 intel_pmu_handle_irq
> 0xfffffe00004d1e48 perf_event_nmi_handler
> 0xfffffe00004d1e50 perf_event_nmi_handler_na.47078
> 0xfffffe00004d1e58 nmi_reason_lock
> 0xfffffe00004d1e60 nmi_handle
> 0xfffffe00004d1e70 nmi_desc
> 0xfffffe00004d1e90 nmi_reason_lock
> 0xfffffe00004d1eb8 default_do_nmi
> 0xfffffe00004d1ed8 do_nmi
> 0xfffffe00004d1ef0 nmi
> 0xfffffe00004d1f78 entry_SYSCALL_64
> 0xfffffe00004d1fa0 entry_SYSCALL_64
> 0xfffffe00004d1fd8 entry_SYSCALL_64
> 
> The kernel is running the normal NMI path and triggers the above mentioned warnings, 
> which calls into string formatting path for the warning messages. 

> However, NMIs run 
> off of its dedicated exception stack which is 4k in size, which is too small for the 
> full string formatting path to run,

Say what ?! Has printk() grown into such a freakish monster that it
cannot run on 4k stacks anymore?

I suppose that KSYM_SYMBOL_LEN on-stack buffer in symbol_string()
doesn't help.

> We noticed that this only happens with precise_ip >= 2. This is caused by 
> setup_pebs_fixed_sample_data() using pens->real_ip:
> 
> 	set_linear_ip(regs, pebs->real_ip);
> 
> For our use case, we do need precise_ip=2. So we would like suggestions to fix the
> warning and/or to avoid double fault. 

It all works as advertised. The BPF program is broken for calling
bpf_get_stack() on a PEBS event.