linux-kernel - Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aD3kfTx8sF8/Yar/@e129823.arm.com>
Date: Mon, 2 Jun 2025 18:50:53 +0100
From: Yeoreum Yun <yeoreum.yun@....com>
To: Mark Rutland <mark.rutland@....com>
Cc: Will Deacon <will@...nel.org>, catalin.marinas@....com,
	geert@...ux-m68k.org, broonie@...nel.org, mcgrof@...nel.org,
	joey.gouly@....com, kristina.martsenko@....com, rppt@...nel.org,
	pcc@...gle.com, bigeasy@...utronix.de, ptosi@...gle.com,
	james.morse@....com, linux-arm-kernel@...ts.infradead.org,
	linux-kernel@...r.kernel.org, ada.coupriediaz@....com
Subject: Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is
 called in a kthread

Hi Mark,

> > Hi Will,
> >
> > > [+Ada]
> > >
> > > On Fri, May 30, 2025 at 10:27:23AM +0100, Yeoreum Yun wrote:
> > > > When a kernel thread hits BUG() outside of an interrupt handler and
> > > > panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
> > > > In this case, the nmi_nesting value in context_tracking becomes
> > > > inconsistent because the proper context tracking exit functions are not reached.
> > > >
> > > > Here’s an example scenario on arm64:
> > > >   1. A kernel thread hits the BUG() macro outside an interrupt handler,
> > > >      and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).
> > > >
> > > >   2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
> > > >      which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)
> > > >
> > > >   3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().
> > > >
> > > >   4. die() then calls make_task_dead(), which terminates the kernel thread and
> > > >      schedules another task—assume this is the idle_task.
> > >
> > > This sounds like there is a genuine imbalance, then, as we're scheduling
> > > in the context of an exception taken from EL1.
> >
> > TBH, this "scheduling" is called in do_exit() to kill BUG()
> > happend task:
> >
> >  el1_dbg()
> >     -> arm64_enter_el1_dbg()
> >       -> do_debug_exception()
> >         -> die()
> >          -> make_task_dead
> >            -> do_exit()
> >              -> schedule()
> >     // unreachable
> >     -> arm64_exit_el1_dbg()
> >
> > Because arm64_enter_el1_dbg() always call ct_nmi_enter(),
> > If do_debug_exception determined to call die(), there is no point to
> > call ct_nmi_exit().
>
> One of the reasons we treak BRK as an NMI is that exception entry for
> BRK will leave all DAIF bits set, whereas schedule() should be called
> with debug and SError unmasked (but IRQ+FIQ masked). Generally, calling
> ct_nmi_enter() prevents preemption (and hence calls to schedule()).

I think ct_nmi_enter() doesn't prevents preemption but
debug_exception_enter() disables preemption.


> Another is that we may have a BUG() or WARN() in entry code where the
> task could be in an inconsistent state, and we need to treat the
> exception like an NMI to avoid consuming that inconsistent state.

So, let's think the "inconsistent" state like:
  -> el0_enter()
	  -> enter_from_user_mode()
		  -> before update ct_state (context_tracking.state), call BUG()/WARN()
			  -> el1_dbg()

It need to call ct_nmi_enter() in el1_dbg() right?

> To handle that properly, we need to:
>
> (a) Figure out what to do with entry code. Last I looked I was under the
>     impression that x86 either didn't have a problem here, or simply
>     ignored it.

TBH, in above case, x86 seems context_traking.state will be broken...
>
> (b) Handle BUG/WARN traps separately from other BRKs, such that we can
>     use local_daif_inherit(), and treat this as a special function call
>     rather than an NMI.
>
> (c) Somehow teach make_task_dead() to handle the case where DAIF.D
>     and/or DAIF.A are set. Most likely we simply have to panic() here,
>     as with BUG() in interrupt context.

Right... It should handle for DAIF.D and DAIF.A bits...

>
> > > >   5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
> > > >      However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
> > > >      ct_kernel_exit() triggers a WARN_ON_ONCE() warning.
> > > >
> > > > Because the kernel thread couldn’t call the appropriate context tracking exit
> > > > function in step 3, the ct->nmi_nesting value remains incorrect.
> > > > This leads to warnings like the following:
> > > >
> > > > [    7.133093] ------------[ cut here ]------------
> > > > [    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
> > > > [    7.134157] Modules linked in:
> > > > [    7.134158]     not ok 62 kasan_strings
> > > > [    7.134280]
> > > > [    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
> > > > [    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
> > > > [    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    7.135441] pc : ct_kernel_exit+0xa4/0xb0
> > > > [    7.135656] lr : ct_kernel_exit+0x1c/0xb0
> > > > [    7.135866] sp : ffff8000829bbd90
> > > > [    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
> > > > [    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
> > > > [    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
> > > > [    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
> > > > [    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > > > [    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > > > [    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
> > > > [    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
> > > > [    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
> > > > [    7.313096] Call trace:
> > > > [    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
> > > > [    7.313445]  ct_idle_enter+0x14/0x28
> > > > [    7.313666]  default_idle_call+0x2c/0x60
> > > > [    7.313902]  do_idle+0xec/0x320
> > > > [    7.314104]  cpu_startup_entry+0x40/0x50
> > > > [    7.314331]  secondary_start_kernel+0x120/0x1a0
> > > >
> > > > This behavior is specific to the arm64 architecture, where ct_nmi_enter()
> > > > is called when handling a debug exception.
> > > > In contrast, on other architectures, ct_nmi_enter() is not called when
> > > > handling BUG().
> > > > (i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
> > > > ct_nmi_enter(). so it doesn’t cause any issues
> > > > (While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
> > > > that doesn’t apply to debug exception handling).
> > >
> > > It sounds like you're suggesting that we don't update the
> > > context-tracking NMI state for BRK exceptions from EL1, to align
> > > with x86.
> >
> > If el1_dbg() doesn't be called in idle_task(),
> > I think it doesn't need to call ct_nmi_enter() in arm64_enter_el1_debug()
> > since its nmi_nesting is always >= CT_NESTING_IRQ_NONIDLE and RCU wathcing this cpu.
> >
> > But, it seems el1_dbg() could be called ct_idle_enter() and ct_idle_exit().
> > actually this situation seems be possible in theory when
> > some idle code have BUG() -- i.e) cpuidle driver's enter callback have BUG().
> > However, this case triggers another quetions. what happen if idle_task was
> > killed (I think it seems panic() case...)
> >
> > So, If arm64_enter_el1_debug() doesn't need to call the ct_nmi_enter()
> > instead, __nmi_enter() should be called only for idle_task().
> >
> > Am I wrong?
>
> As above, I do not think that this is sufficient.
>
> > > I think Ada's pending series might make that easier, but then
> > > the patch you propose:
> > >
> > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > > index 529cff825531..9cf03b9ce691 100644
> > > > --- a/arch/arm64/kernel/traps.c
> > > > +++ b/arch/arm64/kernel/traps.c
> > > > @@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)
> > > >
> > > >  	raw_spin_unlock_irqrestore(&die_lock, flags);
> > > >
> > > > -	if (ret != NOTIFY_STOP)
> > > > +	if (ret != NOTIFY_STOP) {
> > > > +#ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > > > +		long nmi_nesting = ct_nmi_nesting();
> > > > +		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
> > > > +			ct_nmi_exit();
> > > > +#endif
> > >
> > > tries to undo the context-tracking when we realise we're going to kill
> > > the task, which feels like a hack.
> >
> > Although her patches is applied,
> > I think this problem still exist if arm64_enter_el1_dbg() calls ct_nmi_enter().
>
> The idea is that Ada's series will make it *possible* to handle this
> correctly.
>
> > I agree it's a hacky way for handling kernel task die() in debug
> > exception since in case of user task will be killed via signal.
> > However, unless arm64_enter_el1_dbg() calls ct_nmi_enter(),
> > In my narrow view, it seems the best...
>
> As-is, I think an extra warning in the case of a BUG() is fine given
> the larger functional issues.
>
> I do not think this patch is correct as-is.

So, what I think:
  1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
  2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
     conditional call.
  3. DAIF.D and DAIF.A handling.

Is there any missing?

Thanks!

--
Sincerely,
Yeoreum Yun