lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFA6WYNayyOpfu_o8NaoBGvCVSpdfd+ozSUEaqGB=uyX+nAuGg@mail.gmail.com>
Date:   Thu, 31 Aug 2023 18:46:41 +0530
From:   Sumit Garg <sumit.garg@...aro.org>
To:     Mark Rutland <mark.rutland@....com>
Cc:     linux-kernel@...r.kernel.org, dianders@...omium.org,
        keescook@...omium.org, swboyd@...omium.org
Subject: Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs

On Thu, 31 Aug 2023 at 18:38, Mark Rutland <mark.rutland@....com> wrote:
>
> On Thu, Aug 31, 2023 at 06:15:29PM +0530, Sumit Garg wrote:
> > Hi Mark,
> >
> > Thanks for putting up a test case for this.
> >
> > On Thu, 31 Aug 2023 at 15:40, Mark Rutland <mark.rutland@....com> wrote:
> > >
> > > Upon a panic() the kernel will use either smp_send_stop() or
> > > crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> > > which may or may not be an NMI. Generally it's preferable that this is an
> > > NMI so that CPUs can be stopped in as many situations as possible, but
> > > it's not always possible to provide an NMI, and there are cases where
> > > CPUs may be unable to handle the NMI regardless.
> > >
> > > This patch adds a test for panic() where all other CPUs are stuck with
> > > interrupts disabled, which can be used to check whether the kernel
> > > gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> >
> > s/whe/when/
> >
> > > work.
> > >
> > > For example, on arm64 *without* an NMI, this results in:
> > >
> > > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > > | Kernel panic - not syncing: panic stop irqoff test
> > > | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > > | Call trace:
> > > |  dump_backtrace+0x94/0xec
> > > |  show_stack+0x18/0x24
> > > |  dump_stack_lvl+0x74/0xc0
> > > |  dump_stack+0x18/0x24
> > > |  panic+0x358/0x3e8
> > > |  lkdtm_PANIC+0x0/0x18
> > > |  multi_cpu_stop+0x9c/0x1a0
> > > |  cpu_stopper_thread+0x84/0x118
> > > |  smpboot_thread_fn+0x224/0x248
> > > |  kthread+0x114/0x118
> > > |  ret_from_fork+0x10/0x20
> > > | SMP: stopping secondary CPUs
> > > | SMP: failed to stop secondary CPUs 0-3
> > > | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> > > | PHYS_OFFSET: 0x40000000
> > > | CPU features: 0x00000000,68c167a1,cce6773f
> > > | Memory Limit: none
> > > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> > >
> > > On arm64 *with* an NMI, this results in:
> >
> > I suppose a more interesting test scenario to show difference among
> > NMI stop IPI and regular stop IPI would be:
> >
> > - First put any CPU into hard lockup state via:
> >    $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> >
> > - And then provoke following from other CPU:
> >    $ echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
>
> I don't follow. IIUC that's only going to test whether a HW watchdog can fire
> and reset the system?
>
> The PANIC_STOP_IRQOFF test has each CPU run panic_stop_irqoff_fn() with IRQs
> disabled, and if one CPU is stuck in the HARDLOCKUP test, we'll never get all
> CPUs into panic_stop_irqoff_fn(), and so all CPUs will be stuck with IRQs
> disabled, spinning.
>
> The PANIC_STOP_IRQOFF test itself tests the different between an NMI stop IPI
> and regular stop IPI, as the results in the commit message shows. Look for the
> line above that says:
>
> | SMP: failed to stop secondary CPUs 0-3
>
> ... which is *not* present in the NMI case (though we don't have an explicit
> "stoppped all CPUs" message).

Ah, I see your point as I missed that difference when I first looked
up the panic() logs. So it's the post panic() CPU stop behaviour that
we are testing here. Thanks for the explanation.

-Sumit

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ