lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250625120823.60600-1-gmonaco@redhat.com>
Date: Wed, 25 Jun 2025 14:08:22 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: linux-kernel@...r.kernel.org,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Andy Lutomirski <luto@...nel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	linux-arm-kernel@...ts.infradead.org,
	linux-trace-kernel@...r.kernel.org
Cc: Gabriele Monaco <gmonaco@...hat.com>
Subject: [PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs

The irq_enable/irq_disable tracepoints fire only when there's an actual
transition (enabled->disabled and vice versa), this needs special care
in NMIs, as they can potentially start with interrupts already disabled.
The current implementation takes care of this by tracking the lockdep
state on nmi_entry as well as using the variable tracing_irq_cpu to
synchronise with other calls (e.g. local_irq_disable/enable).

This can be racy in case of NMIs when lockdep is enabled, and can lead
to missing events when lockdep is disabled.

Remove dependency on the lockdep status in the NMI common entry/exit
code and adapt the tracing code to make sure that:

- The first call disabling interrupts fires the tracepoint
- The first non-NMI call enabling interrupts fires the tracepoint
- The last NMI call enabling interrupts fires the tracepoint unless
  interrupts were disabled before the NMI
- All other calls don't fire

Fixes: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
Fixes: f0cd5ac1e4c5 ("arm64: entry: fix NMI {user, kernel}->kernel transitions")
Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
---

The inconsistency is visible with the sncid RV monitor and particularly
likely on machines with the following setup:
- x86 bare-metal with 40+ CPUs
- tuned throughput-performance (activating regular perf NMIs)
- workload: stress-ng --cpu-sched 21 --timer 11 --signal 11

The presence of the RV monitor is useful to see the error but it is not
necessary to trigger it.

Changes since V1:
* Reworded confusing changelog
* Remove dependency on lockdep counters for tracepoints
* Ensure we don't drop valid tracepoints
* Extend change to arm64 code

 arch/arm64/kernel/entry-common.c |  5 ++---
 kernel/entry/common.c            |  5 ++---
 kernel/trace/trace_preemptirq.c  | 12 +++++++-----
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 7c1970b341b8c..7f1844123642e 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -213,10 +213,9 @@ static void noinstr arm64_exit_nmi(struct pt_regs *regs)
 	bool restore = regs->lockdep_hardirqs;
 
 	ftrace_nmi_exit();
-	if (restore) {
-		trace_hardirqs_on_prepare();
+	trace_hardirqs_on_prepare();
+	if (restore)
 		lockdep_hardirqs_on_prepare();
-	}
 
 	ct_nmi_exit();
 	lockdep_hardirq_exit();
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417cf..e234f264fb495 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -343,10 +343,9 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
 	instrumentation_begin();
 	ftrace_nmi_exit();
-	if (irq_state.lockdep) {
-		trace_hardirqs_on_prepare();
+	trace_hardirqs_on_prepare();
+	if (irq_state.lockdep)
 		lockdep_hardirqs_on_prepare();
-	}
 	instrumentation_end();
 
 	ct_nmi_exit();
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 0c42b15c38004..fa45474fc54f1 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -58,7 +58,11 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
  */
 void trace_hardirqs_on_prepare(void)
 {
-	if (this_cpu_read(tracing_irq_cpu)) {
+	int tracing_count = this_cpu_read(tracing_irq_cpu);
+
+	if (in_nmi() && tracing_count > 1)
+		this_cpu_dec(tracing_irq_cpu);
+	else if (tracing_count) {
 		trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 		tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
 		this_cpu_write(tracing_irq_cpu, 0);
@@ -89,8 +93,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
  */
 void trace_hardirqs_off_finish(void)
 {
-	if (!this_cpu_read(tracing_irq_cpu)) {
-		this_cpu_write(tracing_irq_cpu, 1);
+	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 	}
@@ -103,8 +106,7 @@ void trace_hardirqs_off(void)
 {
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
-	if (!this_cpu_read(tracing_irq_cpu)) {
-		this_cpu_write(tracing_irq_cpu, 1);
+	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 	}

base-commit: 78f4e737a53e1163ded2687a922fce138aee73f5
-- 
2.49.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ