linux-kernel - Re: [PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5097b17cc506af60ca718aba5a0a10e0fda01440.camel@redhat.com>
Date: Wed, 02 Jul 2025 09:18:32 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Catalin Marinas <catalin.marinas@....com>,
  Will Deacon <will@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, Andy
 Lutomirski <luto@...nel.org>,  Steven Rostedt <rostedt@...dmis.org>, Masami
 Hiramatsu <mhiramat@...nel.org>, Ingo Molnar <mingo@...nel.org>,  Mark
 Rutland <mark.rutland@....com>, linux-arm-kernel@...ts.infradead.org, 
	linux-trace-kernel@...r.kernel.org
Subject: Re: [PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs



On Tue, 2025-07-01 at 14:54 +0200, Peter Zijlstra wrote:
> On Wed, Jun 25, 2025 at 02:08:22PM +0200, Gabriele Monaco wrote:
> > The irq_enable/irq_disable tracepoints fire only when there's an
> > actual
> > transition (enabled->disabled and vice versa), this needs special
> > care
> > in NMIs, as they can potentially start with interrupts already
> > disabled.
> > The current implementation takes care of this by tracking the
> > lockdep
> > state on nmi_entry as well as using the variable tracing_irq_cpu to
> > synchronise with other calls (e.g. local_irq_disable/enable).
> > 
> > This can be racy in case of NMIs when lockdep is enabled, and can
> > lead
> > to missing events when lockdep is disabled.
> > 
> > Remove dependency on the lockdep status in the NMI common
> > entry/exit
> > code and adapt the tracing code to make sure that:
> > 
> > - The first call disabling interrupts fires the tracepoint
> > - The first non-NMI call enabling interrupts fires the tracepoint
> > - The last NMI call enabling interrupts fires the tracepoint unless
> >   interrupts were disabled before the NMI
> > - All other calls don't fire
> 
> I'm not at all convinced this is correct. Nor can I understand
> anything
> much about what you wrote above.
> 
> 
> >  arch/arm64/kernel/entry-common.c |  5 ++---
> >  kernel/entry/common.c            |  5 ++---
> >  kernel/trace/trace_preemptirq.c  | 12 +++++++-----
> >  3 files changed, 11 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/entry-common.c
> > b/arch/arm64/kernel/entry-common.c
> > index 7c1970b341b8c..7f1844123642e 100644
> > --- a/arch/arm64/kernel/entry-common.c
> > +++ b/arch/arm64/kernel/entry-common.c
> > @@ -213,10 +213,9 @@ static void noinstr arm64_exit_nmi(struct
> > pt_regs *regs)
> >  	bool restore = regs->lockdep_hardirqs;
> >  
> >  	ftrace_nmi_exit();
> > -	if (restore) {
> > -		trace_hardirqs_on_prepare();
> > +	trace_hardirqs_on_prepare();
> > +	if (restore)
> >  		lockdep_hardirqs_on_prepare();
> > -	}
> >  
> >  	ct_nmi_exit();
> >  	lockdep_hardirq_exit();
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index a8dd1f27417cf..e234f264fb495 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -343,10 +343,9 @@ void noinstr irqentry_nmi_exit(struct pt_regs
> > *regs, irqentry_state_t irq_state)
> >  {
> >  	instrumentation_begin();
> >  	ftrace_nmi_exit();
> > -	if (irq_state.lockdep) {
> > -		trace_hardirqs_on_prepare();
> > +	trace_hardirqs_on_prepare();
> > +	if (irq_state.lockdep)
> >  		lockdep_hardirqs_on_prepare();
> > -	}
> >  	instrumentation_end();
> >  
> >  	ct_nmi_exit();
> > diff --git a/kernel/trace/trace_preemptirq.c
> > b/kernel/trace/trace_preemptirq.c
> > index 0c42b15c38004..fa45474fc54f1 100644
> > --- a/kernel/trace/trace_preemptirq.c
> > +++ b/kernel/trace/trace_preemptirq.c
> > @@ -58,7 +58,11 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
> >   */
> >  void trace_hardirqs_on_prepare(void)
> >  {
> > -	if (this_cpu_read(tracing_irq_cpu)) {
> > +	int tracing_count = this_cpu_read(tracing_irq_cpu);
> > +
> > +	if (in_nmi() && tracing_count > 1)
> > +		this_cpu_dec(tracing_irq_cpu);
> > +	else if (tracing_count) {
> >  		trace(irq_enable, TP_ARGS(CALLER_ADDR0,
> > CALLER_ADDR1));
> >  		tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
> >  		this_cpu_write(tracing_irq_cpu, 0);
> > @@ -89,8 +93,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
> >   */
> >  void trace_hardirqs_off_finish(void)
> >  {
> > -	if (!this_cpu_read(tracing_irq_cpu)) {
> > -		this_cpu_write(tracing_irq_cpu, 1);
> > +	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
> >  		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
> >  		trace(irq_disable, TP_ARGS(CALLER_ADDR0,
> > CALLER_ADDR1));
> >  	}
> > @@ -103,8 +106,7 @@ void trace_hardirqs_off(void)
> >  {
> >  	lockdep_hardirqs_off(CALLER_ADDR0);
> >  
> > -	if (!this_cpu_read(tracing_irq_cpu)) {
> > -		this_cpu_write(tracing_irq_cpu, 1);
> > +	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
> >  		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
> >  		trace(irq_disable, TP_ARGS(CALLER_ADDR0,
> > CALLER_ADDR1));
> >  	}
> 
> So what about lovely things like:
> 
> trace_hardirqs_on_prepare()
>   if (tracing_irq_cpu) {
>     tracer_hardirqs_on();
>     <NMI>
>       trace_hardirqs_off_finish();
>         if (this_cpu_inc_return() == 1) // will be > 1
> 
> 
> So now we've traced IRQs are on, start an NMI, and loose the IRQs off
> event. Well done.
> 
> 
> This was all safe in that it would occasionally emit a duplicate
> state,
> but no state was wrong. Both your attempts have broken things.
> 
> How about you fix you tool to accept duplicate state in the face of
> NMI
> instead?

Alright, I get this is yet another broken solution.. Thanks for
pointing it out.

The problem here is that, from the model's perspective, we don't only
get duplicated events, but also missing ones:

 d..1. ..932: irq_enable:   caller=finish_task_switch
 d.h3. ..933: event_sncid:  can_sched x irq_disable -> cant_sched
 d.h2. ..933: irq_disable:  caller=irqentry_nmi_enter+0x53 
 d.h2. ..936: nmi_handler:  perf_event_nmi_handler()
< missing irq_enable from NMI >
 ...2. ..942: error_sncid:  event schedule_exit not expected in the
state cant_sched

[This monitor verifies exiting and entering __schedule occur only with
interrupts enabled]

I first thought this was just possible with racing NMIs, but as Thomas
pointed out [1], this can happen much easier when irqsoff is enabled
but lockdep is not.

I could probably only fix this without even considering NMIs when
interrupts are disabled, but I believe if that happens, the tracepoints
would report something wrong, since using tracing_irq_cpu alone does:

local_irq_disable -> trace_irq_off
  nmi_enter -> no trace
  nmi_exit -> trace_irq_on
  // here interrupts are still off, aren't they?
local_irq_enable -> no trace

The idea that I described poorly, was to use tracing_irq_cpu in a way
that the first context disabling interrupts fires the tracepoint
(current behaviour), but when it's time to enable interrupts, an NMI
which didn't disable interrupts shouldn't trace but let the next
context trace.

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/87sejup1fe.ffs@tglx