linux-kernel - [BUG] Core2 cpu triggers hard lockup with perf test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160227123636.GB30858@krava.redhat.com>
Date:	Sat, 27 Feb 2016 13:37:01 +0100
From:	Jiri Olsa <jolsa@...hat.com>
To:	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Andi Kleen <andi@...stfloor.org>,
	Stephane Eranian <eranian@...gle.com>,
	Wang Nan <wangnan0@...wei.com>, zheng.z.yan@...el.com,
	Kan Liang <kan.liang@...el.com>
Cc:	LKML <linux-kernel@...r.kernel.org>
Subject: [BUG] Core2 cpu triggers hard lockup with perf test

hi,
we are getting hard lockups on Core2 cpus (model 23)
just by running 'perf test'

PID: 10425  TASK: ffff880068562e00  CPU: 3   COMMAND: "perf"
 #0 [ffff88007d985a08] machine_kexec at ffffffff8105521b
 #1 [ffff88007d985a68] crash_kexec at ffffffff810f7412
 #2 [ffff88007d985b38] panic at ffffffff8163c031
 #3 [ffff88007d985bb8] watchdog_overflow_callback at ffffffff81120472
 #4 [ffff88007d985bc8] __perf_event_overflow at ffffffff81164e0e
 #5 [ffff88007d985c00] perf_event_overflow at ffffffff81165a44
 #6 [ffff88007d985c10] intel_pmu_handle_irq at ffffffff81033198
 #7 [ffff88007d985e60] perf_event_nmi_handler at ffffffff8164be8b
 #8 [ffff88007d985e80] nmi_handle at ffffffff8164b5d9
 #9 [ffff88007d985ec8] do_nmi at ffffffff8164b789
#10 [ffff88007d985ef0] end_repeat_nmi at ffffffff8164aa13
    [exception RIP: intel_pmu_enable_all+17]
    RIP: ffffffff81032301  RSP: ffff88005e917c98  RFLAGS: 00000046
    RAX: ffff88007d98cd20  RBX: ffff88005e991000  RCX: 000000000000038f
    RDX: 0000000000000007  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff88005e917cd8   R8: ffffffffffffff85   R9: 000000ffffffffff
    R10: ffff88007d98c100  R11: ffff88005e9179e0  R12: ffff88007d98bd10
    R13: ffff88007d98b9e0  R14: ffff88007d98bc08  R15: 0000000000000002
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#11 [ffff88005e917c98] intel_pmu_enable_all at ffffffff81032301
#12 [ffff88005e917c98] x86_pmu_enable at ffffffff8102ba24
#13 [ffff88005e917ce0] perf_pmu_enable at ffffffff81160457
#14 [ffff88005e917cf0] perf_event_context_sched_in at ffffffff81161930
#15 [ffff88005e917d20] perf_event_exec at ffffffff811621db
#16 [ffff88005e917d68] setup_new_exec at ffffffff811edffd
#17 [ffff88005e917d88] load_elf_binary at ffffffff81240ed9
#18 [ffff88005e917e58] search_binary_handler at ffffffff811ec89d
#19 [ffff88005e917ea0] do_execve_common at ffffffff811ede04
#20 [ffff88005e917f30] sys_execve at ffffffff811ee199
#21 [ffff88005e917f50] stub_execve at ffffffff816531a9

the reproducer seems to be hw event with very small
period like (thanks Arnaldo ;-):
  perf record -e cycles -c 123 kill

I bisected it down to the:
  156174999dd1 perf/intel/x86: Enlarge the PEBS buffer

Looks like the bigger PEBS buffer together with event being
marked as PERF_X86_EVENT_FREERUNNING will block the CPU right
after the event is enabled before it could reach local_irq_enable
and trigger the NMI watchdog.

I can't find what's special about Core2 CPU PEBS setup,
it seems that oher CPUs are ok (tried on ivb/snb/hsw).

reverting the 156174999dd1 fixed the issue for me

ideas? thanks,
jirka