linux-kernel - [patch 0/6] Cure kexec() vs. mwait_play

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230603193439.502645149@linutronix.de>
Date:   Sat,  3 Jun 2023 22:06:54 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     x86@...nel.org, Ashok Raj <ashok.raj@...ux.intel.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Tony Luck <tony.luck@...el.com>,
        Arjan van de Veen <arjan@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Eric Biederman <ebiederm@...ssion.com>
Subject: [patch 0/6] Cure kexec() vs. mwait_play_dead() troubles

Hi!

Ashok observed triple faults when executing kexec() on a kernel which has
'nosmt' on the kernel commandline and HT enabled in the BIOS.

'nosmt' brings up the HT siblings to the point where they initiliazed the
CPU and then rolls the bringup back which parks them in mwait_play_dead().
The reason is that all CPUs should have CR4.MCE set. Otherwise a broadcast
MCE will immediately shut down the machine.

Some detective work revealed that:

  1) The kexec kernel can overwrite text, pagetables, stack and data of the
     previous kernel.

  2) If the kexec kernel writes to the memory which is monitored by an
     "offline" CPU, that CPU resumes execution. That's obviously doomed
     when the kexec kernel overwrote text, pagetables, data or stack.

While on my test machine the first kexec() after reset always "worked", the
second one reliably ended up in a triple fault.

The following series cures this by:

  1) Bringing offline CPUs which are stuck in mwait_play_dead() out of
     mwait by writing to the monitored cacheline

  2) Let the woken up CPUs check the written control word and drop into
     a HLT loop if the control word requests so.

This is only half safe because HLT can resume execution due to NMI, SMI and
MCE. Unfortunately there is no real safe mechanism to "park" a CPU reliably,
but there is at least one which prevents the NMI and SMI cause: INIT.

  3) If the system uses the regular INIT/STARTUP sequence to wake up
     secondary CPUS, then "park" all CPUs including the "offline" ones
     by sending them INIT IPIs.

The INIT IPI brings the CPU into a wait for wakeup state which is not
affected by NMI and SMI, but INIT also clears CR4.MCE, so the broadcast MCE
problem comes back.

But that's not really any different from a CPU sitting in the HLT loop on
the previous kernel. If a broadcast MCE arrives, HLT resumes execution and
the CPU tries to handle the MCE on overwritten text, pagetables etc.

So parking them via INIT is not completely solving the problem, but it
takes at least NMI and SMI out of the picture.

The series is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/kexec

Thanks,

	tglx
---
 include/asm/smp.h |    4 +
 kernel/smp.c      |   62 +++++++++++++---------
 kernel/smpboot.c  |  151 ++++++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 156 insertions(+), 61 deletions(-)