[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ED7B9430-6489-4260-B3C5-9CFA2E3AA87A@fb.com>
Date: Thu, 18 Jun 2020 17:25:24 +0000
From: Song Liu <songliubraving@...com>
To: linux-kernel <linux-kernel@...r.kernel.org>,
Kan Liang <kan.liang@...ux.intel.com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Andi Kleen <ak@...ux.intel.com>,
"Josh Poimboeuf" <jpoimboe@...hat.com>,
Peter Zijlstra <peterz@...radead.org>
CC: Tejun Heo <tj@...nel.org>, Andrii Nakryiko <andriin@...com>,
Francis Ricci <fjricci@...com>
Subject: WARNING with LBR + precise_ip=2 + bpf_get_stackid()
Hi,
We are debugging some WARNING with LBR, precise_ip=2 and bpf_get_stackid(), like:
[36000.334284] WARNING: stack recursion on stack type 1
[36000.334288] WARNING: can't access registers at syscall_return_via_sysret+0x12/0x7f
This happens when we attach BPF program to perf_event with:
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.precise_ip = 2,
.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_BRANCH_STACK,
.branch_sample_type = PERF_SAMPLE_BRANCH_USER |
PERF_SAMPLE_BRANCH_NO_FLAGS |
PERF_SAMPLE_BRANCH_NO_CYCLES |
PERF_SAMPLE_BRANCH_CALL_STACK,
.sample_period = 500000,
.size = sizeof(struct perf_event_attr),
};
and calls bpf_get_stackid() from the BPF program. I pushed a reproducer to
https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=lbr_issue
under tools/bpf/ministrobe/. It requires latest CLANG to build. I also included
the binary that should just run on CentOS 7.
This warning is usually harmless, but it may also cause double fault #DF. Here are
some analysis from Tejun:
"""
The exact pathway is still under investigation but it triggered a sanity warning in
the kernel backtrace acquisition code looking like the following:
[ 2051.040745] WARNING: stack recursion on stack type 1
This is from `arch/x86/kernel/dumpstack_64.c::get_stack_info()` and is warning that
while walking the stack to obtain backtrace it encountered the same type of stack
more than once. The warning in itself isn't critical. It's a sanity check and when
the check fails, it just stops walking the stack and it is likely that the warning
is being triggered spuriously given that more machines which trigger these warnings
continue to run fine than not, which is unlikely with actual stack corruptions. A
possibility is that stack acquisition is happening from a context that the code
can't handle.
However, on some machines, this caused #DF (double fault) and thus immediate crash.
Backtracing a DF'd kernel is a bit cumbersome, so the following contains spurious
entries, but it does show what happened:
PID: 80430 TASK: ffff888d92c62a80 CPU: 24 COMMAND: "25_scheduler"
#0 [fffffe00004cfd88] machine_kexec at ffffffff8104a646
#1 [fffffe00004cfdd8] __crash_kexec at ffffffff8114a82f
#2 [fffffe00004cfea0] panic at ffffffff810ba99c
#3 [fffffe00004cff20] df_debug at ffffffff8104e21d
#4 [fffffe00004cff30] do_double_fault at ffffffff8101c9b4
#5 [fffffe00004cff50] double_fault at ffffffff81c00b3e
[exception RIP: vsnprintf+14]
RIP: ffffffff81a3ee6e RSP: fffffe00004d0ff8 RFLAGS: 00010082
RAX: fffffe00004d1070 RBX: fffffe00004d1101 RCX: fffffe00004d1050
RDX: ffffffff822c42f5 RSI: 000000007fffffff RDI: fffffe00004d111a
RBP: fffffe00004d10a0 R8: 0000000000000067 R9: ffffffff82209d05
R10: ffffffff822a5fd0 R11: ffffffff822a6358 R12: 0000000000000019
R13: 0000000000000001 R14: 0000000000000019 R15: ffffffff822b5fd6
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <DOUBLEFAULT exception stack> ---
#6 [fffffe00004d0ff8] vsnprintf at ffffffff81a3ee6e
bt: cannot transition from exception stack to current process stack:
exception stack pointer: fffffe00004cfd88
process stack pointer: fffffe00004d1048
current stack base: ffffc900241a0000
0xfffffe00004d1040 kallsyms_token_index
0xfffffe00004d1048 sprintf
0xfffffe00004d1078 kallsyms_lookup
0xfffffe00004d1098 kallsyms_names
0xfffffe00004d10a8 __sprint_symbol
0xfffffe00004d10d8 textbuf.47672
0xfffffe00004d10e0 always_kmsg_dump
0xfffffe00004d10f8 symbol_string
0xfffffe00004d11c0 textbuf.47672
0xfffffe00004d11e8 textbuf.47672
0xfffffe00004d11f8 textbuf.47672
0xfffffe00004d1200 always_kmsg_dump
0xfffffe00004d1210 kallsyms_token_index
0xfffffe00004d1218 vsnprintf
0xfffffe00004d1220 textbuf.47672
0xfffffe00004d1258 kallsyms_token_index
0xfffffe00004d1270 vscnprintf
0xfffffe00004d1280 vprintk_store
0xfffffe00004d1290 wake_up_klogd
0xfffffe00004d12b0 kallsyms_token_index
0xfffffe00004d12c8 vprintk_emit
0xfffffe00004d1300 entry_SYSCALL_64
0xfffffe00004d1318 vprintk_deferred
0xfffffe00004d1328 printk_deferred
0xfffffe00004d1360 entry_SYSCALL_64
0xfffffe00004d1380 __start_orc_unwind
0xfffffe00004d1388 unwind_next_frame.cold.7
0xfffffe00004d13c8 perf_callchain_kernel
0xfffffe00004d1418 entry_SYSCALL_64
0xfffffe00004d1450 get_perf_callchain
0xfffffe00004d14b0 bpf_get_stack
0xfffffe00004d1730 bpf_overflow_handler
0xfffffe00004d1788 __perf_event_overflow
0xfffffe00004d17a0 x86_pmu
0xfffffe00004d17b8 __intel_pmu_pebs_event
0xfffffe00004d17e0 setup_pebs_fixed_sample_data
0xfffffe00004d1890 entry_SYSCALL_64
0xfffffe00004d1ab8 intel_pmu_drain_pebs_nhm
0xfffffe00004d1ac0 setup_pebs_fixed_sample_data
0xfffffe00004d1bb8 handle_pmi_common
0xfffffe00004d1d00 insn_get_sib.part.5
0xfffffe00004d1d10 insn_get_displacement.part.6
0xfffffe00004d1d20 insn_get_immediate.part.7
0xfffffe00004d1d38 __set_pte_vaddr
0xfffffe00004d1d78 __native_set_fixmap
0xfffffe00004d1d80 ghes_copy_tofrom_phys
0xfffffe00004d1e08 intel_pmu_handle_irq
0xfffffe00004d1e48 perf_event_nmi_handler
0xfffffe00004d1e50 perf_event_nmi_handler_na.47078
0xfffffe00004d1e58 nmi_reason_lock
0xfffffe00004d1e60 nmi_handle
0xfffffe00004d1e70 nmi_desc
0xfffffe00004d1e90 nmi_reason_lock
0xfffffe00004d1eb8 default_do_nmi
0xfffffe00004d1ed8 do_nmi
0xfffffe00004d1ef0 nmi
0xfffffe00004d1f78 entry_SYSCALL_64
0xfffffe00004d1fa0 entry_SYSCALL_64
0xfffffe00004d1fd8 entry_SYSCALL_64
The kernel is running the normal NMI path and triggers the above mentioned warnings,
which calls into string formatting path for the warning messages. However, NMIs run
off of its dedicated exception stack which is 4k in size, which is too small for the
full string formatting path to run, so it ends up overflowing the stack while running
NMI and the machine enters an unrecoverable state.
"""
We found the issue in multiple versions of kernel, from 4.16 to 5.8-rc1. The exact
WARNING message varies among different versions. Latest kernel put these warnings
under WARN_ONCE(), so they are not flushing dmesg.
We noticed that this only happens with precise_ip >= 2. This is caused by
setup_pebs_fixed_sample_data() using pens->real_ip:
set_linear_ip(regs, pebs->real_ip);
For our use case, we do need precise_ip=2. So we would like suggestions to fix the
warning and/or to avoid double fault.
Please share your thoughts on this.
Thanks in advance!
Song
Powered by blists - more mailing lists