linux-kernel - [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250620125112.33978-2-gmonaco@redhat.com>
Date: Fri, 20 Jun 2025 14:51:13 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: linux-kernel@...r.kernel.org,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Andy Lutomirski <luto@...nel.org>,
	Ingo Molnar <mingo@...nel.org>
Cc: Gabriele Monaco <gmonaco@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	linux-trace-kernel@...r.kernel.org
Subject: [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs

The irq_enable/irq_disable tracepoints fire only when there's an actual
transition (enabled->disabled and vice versa), this needs special care
in NMIs, as they could potentially start with IRQs already disabled.
The current implementation takes care of this by tracking the lockdep
state before the NMI (on nmi_entry) and not tracing on nmi_exit in case
IRQs were already disabled, we don't trace on nmi_entry following the
tracing_irq_cpu variable, which can be racy:

 local_irq_enable()
   void trace_hardirqs_on(void)
   {
   	if (tracing_irq_cpu) {
   		trace(irq_enable);
   		tracing_irq_cpu = 0;
   	}

                /*
                 * NMI here
                 * tracing_irq_cpu == 0 (done tracing)
                 * lockdep_hardirqs_enabled == 0 (IRQs still disabled)
                 */

                                   irqentry_nmi_enter()
                                       irq_state.lockdep = 0
                                       trace(irq_disable);

                                   irqentry_nmi_exit()
                                       // irq_state.lockdep == 0
                                       // do not trace(irq_enable)
   	lockdep_hardirqs_on();
   }

The error is visible with the sncid RV monitor and particularly likely
on machines with the following setup:
- x86 bare-metal with 40+ CPUs
- tuned throughput-performance (activating regular perf NMIs)
- workload: stress-ng --cpu-sched 21 --timer 11 --signal 11

The presence of the RV monitor is useful to see the error but it is not
necessary to trigger it.

Prevent this scenario by checking lockdep_hardirqs_enabled to trace also
on nmi_entry.

Fixes: ba1f2b2eaa2a ("x86/entry: Fix NMI vs IRQ state tracking")
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Masami Hiramatsu <mhiramat@...nel.org>
Cc: linux-trace-kernel@...r.kernel.org
Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
---
 kernel/entry/common.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417cf..7369132c00ba4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -326,13 +326,15 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	irq_state.lockdep = lockdep_hardirqs_enabled();
 
 	__nmi_enter();
-	lockdep_hardirqs_off(CALLER_ADDR0);
+	if (irq_state.lockdep)
+		lockdep_hardirqs_off(CALLER_ADDR0);
 	lockdep_hardirq_enter();
 	ct_nmi_enter();
 
 	instrumentation_begin();
 	kmsan_unpoison_entry_regs(regs);
-	trace_hardirqs_off_finish();
+	if (irq_state.lockdep)
+		trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
 	instrumentation_end();
 

base-commit: 75f5f23f8787c5e184fcb2fbcd02d8e9317dc5e7
-- 
2.49.0