linux-kernel - Re: [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87sejup1fe.ffs@tglx>
Date: Fri, 20 Jun 2025 23:07:17 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
 Peter Zijlstra <peterz@...radead.org>, Andy Lutomirski <luto@...nel.org>,
 Ingo Molnar <mingo@...nel.org>
Cc: Gabriele Monaco <gmonaco@...hat.com>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
 linux-trace-kernel@...r.kernel.org
Subject: Re: [PATCH] lockdep: Fix inconsistency in irq tracking on NMIs

On Fri, Jun 20 2025 at 14:51, Gabriele Monaco wrote:

The subject prefix is misleading. The problem is not in lockdep, the
problem is in the generic entry NMI code, no?

> The irq_enable/irq_disable tracepoints fire only when there's an actual

Now it gets truly confusing. Above you say lockdep, now it's trace
points...

> transition (enabled->disabled and vice versa), this needs special care
> in NMIs, as they could potentially start with IRQs already disabled.

s/IRQs/interrupts/ please. This is not twatter.

> The current implementation takes care of this by tracking the lockdep
> state before the NMI (on nmi_entry) and not tracing on nmi_exit in case
> IRQs were already disabled, we don't trace on nmi_entry following the
> tracing_irq_cpu variable, which can be racy:

This sentence does not parse, especially the subordinate clause starting
with 'we' does not make sense.

> The error is visible with the sncid RV monitor and particularly likely
> on machines with the following setup:
> - x86 bare-metal with 40+ CPUs
> - tuned throughput-performance (activating regular perf NMIs)
> - workload: stress-ng --cpu-sched 21 --timer 11 --signal 11
>
> The presence of the RV monitor is useful to see the error but it is not
> necessary to trigger it.

Not sure whether this information is useful in the change log itself. It
can go into the notes section after the '---' seperator.

> @@ -326,13 +326,15 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  	irq_state.lockdep = lockdep_hardirqs_enabled();
>  
>  	__nmi_enter();
> -	lockdep_hardirqs_off(CALLER_ADDR0);
> +	if (irq_state.lockdep)
> +		lockdep_hardirqs_off(CALLER_ADDR0);

This avoids the lockdep call, which has nothing to do with your tracing
problem. lockdep already handles that case. So making this conditional
is a cosmetic noop, nothing else. It's actually questionable whether
this conditional makes sense performance wise. It only makes sense when
the amount of lockdep state == disabled instances is significant.
Otherwise it introduces an extra conditional into the hot path.

>  	lockdep_hardirq_enter();
>  	ct_nmi_enter();
>  
>  	instrumentation_begin();
>  	kmsan_unpoison_entry_regs(regs);
> -	trace_hardirqs_off_finish();
> +	if (irq_state.lockdep)
> +		trace_hardirqs_off_finish();

Now this is the real thing you are interested in, because otherwise you
end up with a trace_irqsoff() event without the corresponding
trace_irqson() event, right?

So in short what you are trying to explain in the change log is:

irqentry_nmi_enter() tracks the lockdep interrupt state on entry to
prevent a lockdep_hardirqs_on() invocation in the NMI exit path, when
lockdep was interrupted by the NMI before it could mark interrupts
enabled in the lockdep tracking. This works correctly, but a similar
problem exists for the local_irq_* tracepoints.

local_irq_enable() invokes trace_hardirqs_on(), which invokes
lockdep_hardirqs_on() after establishing the tracer state.

If the NMI hits before lockdep_hardirqs_on() has established the lockdep
enabled state, then unconditional trace_hardirqs_off_finish() invocation
in irqentry_nmi_enter() flips the tracer state to disabled.

In irqentry_nmi_exit() the counterpart trace_hardirqs_on_prepare() is
only invoked when the lockdep interrupt state on entry was 'enabled',
which means the trace lacks the corresponding interrupt enable entry.

Fix this by making the invocation of trace_hardirqs_off_finish() in
irqentry_nmi_enter() conditional on the lockdep state.

Right?

But you missed something completely here. This problem exists also when
lockdep is disabled and then it is way simpler to trigger. Why?

If lockdep is off, then lockdep_hardirqs_enabled() always returns
false. So any entry where the tracer state is 'enabled' will record a
off, but the exit will not record the on.

As you correctly described, the two states are asynchronous in the
context of NMIs, so the obvious thing is to treat them as seperate
entities so that the trace really contains the off/on events independent
of lockdep enablement and lockdep's internal state. No?

The only other sensible option is to remove the trace points from the
NMI code all together because with the lockdep conditional they are
inconsistent no matter what.

Aside of that you missed to fix the corresponding problem in the arm64
entry code, which still has it's own version of the generic entry code.

Thanks,

        tglx