linux-kernel - WARNING with LBR + precise_ip=2 + bpf_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ED7B9430-6489-4260-B3C5-9CFA2E3AA87A@fb.com>
Date:   Thu, 18 Jun 2020 17:25:24 +0000
From:   Song Liu <songliubraving@...com>
To:     linux-kernel <linux-kernel@...r.kernel.org>,
        Kan Liang <kan.liang@...ux.intel.com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        "Josh Poimboeuf" <jpoimboe@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
CC:     Tejun Heo <tj@...nel.org>, Andrii Nakryiko <andriin@...com>,
        Francis Ricci <fjricci@...com>
Subject: WARNING with LBR + precise_ip=2 + bpf_get_stackid()

Hi, 

We are debugging some WARNING with LBR, precise_ip=2 and bpf_get_stackid(), like:

[36000.334284] WARNING: stack recursion on stack type 1
[36000.334288] WARNING: can't access registers at syscall_return_via_sysret+0x12/0x7f


This happens when we attach BPF program to perf_event with:

        struct perf_event_attr attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .precise_ip = 2,
                .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_BRANCH_STACK,
                .branch_sample_type = PERF_SAMPLE_BRANCH_USER |
                        PERF_SAMPLE_BRANCH_NO_FLAGS |
                        PERF_SAMPLE_BRANCH_NO_CYCLES |
                        PERF_SAMPLE_BRANCH_CALL_STACK,
                .sample_period = 500000,
                .size = sizeof(struct perf_event_attr),
        };

and calls bpf_get_stackid() from the BPF program. I pushed a reproducer to 

  https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=lbr_issue

under tools/bpf/ministrobe/. It requires latest CLANG to build. I also included 
the binary that should just run on CentOS 7. 


This warning is usually harmless, but it may also cause double fault #DF. Here are 
some analysis from Tejun:

"""
The exact pathway is still under investigation but it triggered a sanity warning in 
the kernel backtrace acquisition code looking like the following:
[ 2051.040745] WARNING: stack recursion on stack type 1
This is from `arch/x86/kernel/dumpstack_64.c::get_stack_info()` and is warning that
 while walking the stack to obtain backtrace it encountered the same type of stack 
more than once. The warning in itself isn't critical. It's a sanity check and when 
the check fails, it just stops walking the stack and it is likely that the warning 
is being triggered spuriously given that more machines which trigger these warnings 
continue to run fine than not, which is unlikely with actual stack corruptions. A 
possibility is that stack acquisition is happening from a context that the code 
can't handle.
However, on some machines, this caused #DF (double fault) and thus immediate crash. 
Backtracing a DF'd kernel is a bit cumbersome, so the following contains spurious 
entries, but it does show what happened:

PID: 80430  TASK: ffff888d92c62a80  CPU: 24  COMMAND: "25_scheduler"
#0 [fffffe00004cfd88] machine_kexec at ffffffff8104a646
#1 [fffffe00004cfdd8] __crash_kexec at ffffffff8114a82f
#2 [fffffe00004cfea0] panic at ffffffff810ba99c
#3 [fffffe00004cff20] df_debug at ffffffff8104e21d
#4 [fffffe00004cff30] do_double_fault at ffffffff8101c9b4
#5 [fffffe00004cff50] double_fault at ffffffff81c00b3e
[exception RIP: vsnprintf+14]
RIP: ffffffff81a3ee6e  RSP: fffffe00004d0ff8  RFLAGS: 00010082
RAX: fffffe00004d1070  RBX: fffffe00004d1101  RCX: fffffe00004d1050
RDX: ffffffff822c42f5  RSI: 000000007fffffff  RDI: fffffe00004d111a
RBP: fffffe00004d10a0   R8: 0000000000000067   R9: ffffffff82209d05
R10: ffffffff822a5fd0  R11: ffffffff822a6358  R12: 0000000000000019
R13: 0000000000000001  R14: 0000000000000019  R15: ffffffff822b5fd6
ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <DOUBLEFAULT exception stack> ---
#6 [fffffe00004d0ff8] vsnprintf at ffffffff81a3ee6e
bt: cannot transition from exception stack to current process stack:
exception stack pointer: fffffe00004cfd88
process stack pointer: fffffe00004d1048
current stack base: ffffc900241a0000

0xfffffe00004d1040 kallsyms_token_index
0xfffffe00004d1048 sprintf
0xfffffe00004d1078 kallsyms_lookup
0xfffffe00004d1098 kallsyms_names
0xfffffe00004d10a8 __sprint_symbol
0xfffffe00004d10d8 textbuf.47672
0xfffffe00004d10e0 always_kmsg_dump
0xfffffe00004d10f8 symbol_string
0xfffffe00004d11c0 textbuf.47672
0xfffffe00004d11e8 textbuf.47672
0xfffffe00004d11f8 textbuf.47672
0xfffffe00004d1200 always_kmsg_dump
0xfffffe00004d1210 kallsyms_token_index
0xfffffe00004d1218 vsnprintf
0xfffffe00004d1220 textbuf.47672
0xfffffe00004d1258 kallsyms_token_index
0xfffffe00004d1270 vscnprintf
0xfffffe00004d1280 vprintk_store
0xfffffe00004d1290 wake_up_klogd
0xfffffe00004d12b0 kallsyms_token_index
0xfffffe00004d12c8 vprintk_emit
0xfffffe00004d1300 entry_SYSCALL_64
0xfffffe00004d1318 vprintk_deferred
0xfffffe00004d1328 printk_deferred
0xfffffe00004d1360 entry_SYSCALL_64
0xfffffe00004d1380 __start_orc_unwind
0xfffffe00004d1388 unwind_next_frame.cold.7
0xfffffe00004d13c8 perf_callchain_kernel
0xfffffe00004d1418 entry_SYSCALL_64
0xfffffe00004d1450 get_perf_callchain
0xfffffe00004d14b0 bpf_get_stack
0xfffffe00004d1730 bpf_overflow_handler
0xfffffe00004d1788 __perf_event_overflow
0xfffffe00004d17a0 x86_pmu
0xfffffe00004d17b8 __intel_pmu_pebs_event
0xfffffe00004d17e0 setup_pebs_fixed_sample_data
0xfffffe00004d1890 entry_SYSCALL_64
0xfffffe00004d1ab8 intel_pmu_drain_pebs_nhm
0xfffffe00004d1ac0 setup_pebs_fixed_sample_data
0xfffffe00004d1bb8 handle_pmi_common
0xfffffe00004d1d00 insn_get_sib.part.5
0xfffffe00004d1d10 insn_get_displacement.part.6
0xfffffe00004d1d20 insn_get_immediate.part.7
0xfffffe00004d1d38 __set_pte_vaddr
0xfffffe00004d1d78 __native_set_fixmap
0xfffffe00004d1d80 ghes_copy_tofrom_phys
0xfffffe00004d1e08 intel_pmu_handle_irq
0xfffffe00004d1e48 perf_event_nmi_handler
0xfffffe00004d1e50 perf_event_nmi_handler_na.47078
0xfffffe00004d1e58 nmi_reason_lock
0xfffffe00004d1e60 nmi_handle
0xfffffe00004d1e70 nmi_desc
0xfffffe00004d1e90 nmi_reason_lock
0xfffffe00004d1eb8 default_do_nmi
0xfffffe00004d1ed8 do_nmi
0xfffffe00004d1ef0 nmi
0xfffffe00004d1f78 entry_SYSCALL_64
0xfffffe00004d1fa0 entry_SYSCALL_64
0xfffffe00004d1fd8 entry_SYSCALL_64

The kernel is running the normal NMI path and triggers the above mentioned warnings, 
which calls into string formatting path for the warning messages. However, NMIs run 
off of its dedicated exception stack which is 4k in size, which is too small for the 
full string formatting path to run, so it ends up overflowing the stack while running 
NMI and the machine enters an unrecoverable state.
"""

We found the issue in multiple versions of kernel, from 4.16 to 5.8-rc1. The exact 
WARNING message varies among different versions. Latest kernel put these warnings 
under WARN_ONCE(), so they are not flushing dmesg. 

We noticed that this only happens with precise_ip >= 2. This is caused by 
setup_pebs_fixed_sample_data() using pens->real_ip:

	set_linear_ip(regs, pebs->real_ip);

For our use case, we do need precise_ip=2. So we would like suggestions to fix the
warning and/or to avoid double fault. 

Please share your thoughts on this. 

Thanks in advance!
Song