linux-kernel - [PATCH V3 1/1] perf, x86: fix pebs warning by only restore active pmu in pmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1457046448-6184-1-git-send-email-kan.liang@intel.com>
Date:	Thu,  3 Mar 2016 18:07:28 -0500
From:	kan.liang@...el.com
To:	peterz@...radead.org, linux-kernel@...r.kernel.org
Cc:	ak@...ux.intel.com, eranian@...gle.com,
	alexander.shishkin@...ux.intel.com, vincent.weaver@...ne.edu,
	Kan Liang <kan.liang@...el.com>
Subject: [PATCH V3 1/1] perf, x86: fix pebs warning by only restore active pmu in pmi

From: Kan Liang <kan.liang@...el.com>

This patch tries to fix a pebs warning found in my stress test.
The following perf command can easily trigger the pebs warning or
spurious NMI error on Skylake/Broadwell/Haswell platforms.

sudo perf record -e
'cpu/umask=0x04,event=0xc4/pp,cycles,branches,ref-cycles,cache-misses,
cache-references' --call-graph fp -b -c1000 -a
Also NMI watchdog must be enabled.

For this case, the events number is larger than counter number. So
perf has to do multiplexing. In perf_mux_hrtimer_handler, it does
perf_pmu_disable, schedule out old events, rotate_ctx, schedule in new
events and finally perf_pmu_enable. If the old events include precise
event, the MSR_IA32_PEBS_ENABLE should be cleared when perf_pmu_disable.
The MSR_IA32_PEBS_ENABLE should keep 0 until the perf_pmu_enable is
called and the new event is precise event.
However, there is a corner case which could restore PEBS_ENABLE to stale
value during the above period. In perf_pmu_disable, GLOBAL_CTRL will be
set to 0 to stop overflow and followed PMI. But there may be pending PMI
from an earlier overflow, which cannot be stopped. So even GLOBAL_CTRL
is cleared, the kernel still be possible to get PMI. At the end of the
PMI handler, __intel_pmu_enable_all will be called, which will restore
the stale values if old events haven't scheduled out.
Once the stale pebs value is set, it's impossible to be corrected if the
new events are non-precise. Because the pebs_enabled will be set to 0.
x86_pmu.enable_all will ignore the MSR_IA32_PEBS_ENABLE setting.
As a result, the following NMI with stale PEBS_ENABLE trigger pebs
warning.

The pending PMI after enabled=0 will become harmless if the NMI handler
does not change the state. This patch checks cpuc->enabled in pmi and
only restore the state when pmu is active.

Here is the dump.

[  113.452176] Call Trace:
[  113.452178]  <NMI>  [<ffffffff813c3a2e>] dump_stack+0x63/0x85
[  113.452188]  [<ffffffff810a46f2>] warn_slowpath_common+0x82/0xc0
[  113.452190]  [<ffffffff810a483a>] warn_slowpath_null+0x1a/0x20
[  113.452193]  [<ffffffff8100fe2e>]
intel_pmu_drain_pebs_nhm+0x2be/0x320
[  113.452197]  [<ffffffff8100caa9>] intel_pmu_handle_irq+0x279/0x460
[  113.452204]  [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[  113.452208]  [<ffffffff811f290d>] ? vunmap_page_range+0x20d/0x330
[  113.452211]  [<ffffffff811f2f11>] ?
unmap_kernel_range_noflush+0x11/0x20
[  113.452216]  [<ffffffff8148379f>] ? ghes_copy_tofrom_phys+0x10f/0x2a0
[  113.452218]  [<ffffffff814839c8>] ? ghes_read_estatus+0x98/0x170
[  113.452224]  [<ffffffff81005a7d>] perf_event_nmi_handler+0x2d/0x50
[  113.452230]  [<ffffffff810310b9>] nmi_handle+0x69/0x120
[  113.452233]  [<ffffffff810316f6>] default_do_nmi+0xe6/0x100
[  113.452236]  [<ffffffff810317f2>] do_nmi+0xe2/0x130
[  113.452240]  [<ffffffff817aea71>] end_repeat_nmi+0x1a/0x1e
[  113.452243]  [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[  113.452246]  [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[  113.452249]  [<ffffffff810639b6>] ? native_write_msr_safe+0x6/0x40
[  113.452250]  <<EOE>>  <IRQ>  [<ffffffff81006df8>] ?
x86_perf_event_set_period+0xd8/0x180
[  113.452255]  [<ffffffff81006eec>] x86_pmu_start+0x4c/0x100
[  113.452258]  [<ffffffff8100722d>] x86_pmu_enable+0x28d/0x300
[  113.452263]  [<ffffffff811994d7>] perf_pmu_enable.part.81+0x7/0x10
[  113.452267]  [<ffffffff8119cb70>]
perf_mux_hrtimer_handler+0x200/0x280
[  113.452270]  [<ffffffff8119c970>] ?
__perf_install_in_context+0xc0/0xc0
[  113.452273]  [<ffffffff8110f92d>] __hrtimer_run_queues+0xfd/0x280
[  113.452276]  [<ffffffff811100d8>] hrtimer_interrupt+0xa8/0x190
[  113.452278]  [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[  113.452283]  [<ffffffff81051bd8>]
local_apic_timer_interrupt+0x38/0x60
[  113.452286]  [<ffffffff817af01d>] smp_apic_timer_interrupt+0x3d/0x50
[  113.452290]  [<ffffffff817ad15c>] apic_timer_interrupt+0x8c/0xa0
[  113.452291]  <EOI>  [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[  113.452298]  [<ffffffff81123de5>] ?
smp_call_function_single+0xd5/0x130
[  113.452300]  [<ffffffff81123ddb>] ?
smp_call_function_single+0xcb/0x130
[  113.452303]  [<ffffffff81199080>] ?
__perf_read_group_add.part.61+0x1a0/0x1a0
[  113.452306]  [<ffffffff8119765a>] event_function_call+0x10a/0x120
[  113.452308]  [<ffffffff8119c660>] ? ctx_resched+0x90/0x90
[  113.452311]  [<ffffffff811971e0>] ? cpu_clock_event_read+0x30/0x30
[  113.452313]  [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
[  113.452315]  [<ffffffff8119772b>] _perf_event_enable+0x5b/0x70
[  113.452318]  [<ffffffff81197388>] perf_event_for_each_child+0x38/0xa0
[  113.452320]  [<ffffffff811976d0>] ? _perf_event_disable+0x60/0x60
[  113.452322]  [<ffffffff811a0ffd>] perf_ioctl+0x12d/0x3c0
[  113.452326]  [<ffffffff8134d855>] ? selinux_file_ioctl+0x95/0x1e0
[  113.452330]  [<ffffffff8124a3a1>] do_vfs_ioctl+0xa1/0x5a0
[  113.452334]  [<ffffffff81036d29>] ? sched_clock+0x9/0x10
[  113.452336]  [<ffffffff8124a919>] SyS_ioctl+0x79/0x90
[  113.452338]  [<ffffffff817ac4b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[  113.452340] ---[ end trace aef202839fe9a71d ]---
[  113.452611] Uhhuh. NMI received for unknown reason 2d on CPU 2.
[  113.453880] Do you have a strange power saving mode enabled?

Signed-off-by: Kan Liang <kan.liang@...el.com>
---

Changes since V2:
 - Add more comments for __intel_pmu_disable_all and x86_pmu_disable
 - DO the same fix for KNC driver (Not Test)

Changes since V1:
 - Update the corner case in changelog
 - Drop the way of always update msr when disabling pebs event

 arch/x86/events/core.c       | 12 ++++++++++++
 arch/x86/events/intel/core.c | 11 ++++++++++-
 arch/x86/events/intel/knc.c  |  4 +++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7402c818..af9a224 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -597,6 +597,18 @@ void x86_pmu_disable_all(void)
 	}
 }
 
+/*
+ * There may be PMI landing after enabled=0. The PMI hitting
+ * could be before or after disable_all.
+ * If PMI hits before disable_all, the pmu will be disabled
+ * in NMI handler. It will not be re-enabled in NMI handler
+ * again, because enabled=0. After handling the NMI, disable_all
+ * will be called, which will not change the state either.
+ * If PMI hits after disable_all, the pmu is already disabled
+ * before entering NMI handler. The NMI handler will not change
+ * the state either.
+ * So either situation is harmless.
+ */
 static void x86_pmu_disable(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a7ec685..44e1c01 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1503,6 +1503,13 @@ static __initconst const u64 knl_hw_cache_extra_regs
 
 /*
  * Use from PMIs where the LBRs are already disabled.
+ *
+ * This function could be called consecutively. It is required to
+ * remain in disable state if called consecutively.
+ * During consecutive calls, the same disable value will be written
+ * to related registers, so the pmu state remains unchanged.
+ * hw.state in intel_bts_disable_local will remain PERF_HES_STOPPED
+ * either in consecutive calls.
  */
 static void __intel_pmu_disable_all(void)
 {
@@ -1929,7 +1936,9 @@ again:
 		goto again;
 
 done:
-	__intel_pmu_enable_all(0, true);
+	/* Only restore pmu state when it's active. */
+	if (cpuc->enabled != 0)
+		__intel_pmu_enable_all(0, true);
 	/*
 	 * Only unmask the NMI after the overflow counters
 	 * have been reset. This avoids spurious NMIs on
diff --git a/arch/x86/events/intel/knc.c b/arch/x86/events/intel/knc.c
index 206226e..a74eb15 100644
--- a/arch/x86/events/intel/knc.c
+++ b/arch/x86/events/intel/knc.c
@@ -263,7 +263,9 @@ again:
 		goto again;
 
 done:
-	knc_pmu_enable_all(0);
+	/* Only restore pmu state when it's active. */
+	if (cpuc->enabled != 0)
+		knc_pmu_enable_all(0);
 
 	return handled;
 }
-- 
2.4.3