linux-kernel - Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230620092324.GWZJFwDDVo3TRUI0Ck@fat_crate.local>
Date:   Tue, 20 Jun 2023 11:23:24 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
        Mario Limonciello <mario.limonciello@....com>,
        Tom Lendacky <thomas.lendacky@....com>,
        Tony Battersby <tonyb@...ernetics.com>,
        Ashok Raj <ashok.raj@...ux.intel.com>,
        Tony Luck <tony.luck@...el.com>,
        Arjan van de Veen <arjan@...ux.intel.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Ashok Raj <ashok.raj@...el.com>
Subject: Re: [patch v3 5/7] x86/smp: Cure kexec() vs. mwait_play_dead()
 breakage

On Thu, Jun 15, 2023 at 10:33:57PM +0200, Thomas Gleixner wrote:
> TLDR: It's a mess.

LOL. You don't need the rest of the commit message - this says it all.
:-P

> When kexec() is executed on a system with "offline" CPUs, which are parked

I'd say simply "with offlined CPUs"

> in mwait_play_dead() it can end up in a triple fault during the bootup of
> the kexec kernel or cause hard to diagnose data corruption.
> 
> The reason is that kexec() eventually overwrites the previous kernels text,

kernel's

> page tables, data and stack, If it writes to the cache line which is

... and stack. If it ...

> monitored by an previously offlined CPU, MWAIT resumes execution and ends

... by a previously...

> up executing the wrong text, dereferencing overwritten page tables or
> corrupting the kexec kernels data.

Lovely.

> Cure this by bringing the offline CPUs out of MWAIT into HLT.

... offlined CPUs... :

> Write to the monitored cache line of each offline CPU, which makes MWAIT

ditto.

> resume execution. The written control word tells the offline CPUs to issue

ditto and so on.

> HLT, which does not have the MWAIT problem.
> 
> That does not help, if a stray NMI, MCE or SMI hits the offline CPUs as
> those make it come out of HLT.
> 
> A follow up change will put them into INIT, which protects at least against
> NMI and SMI.

...

> @@ -1807,6 +1815,10 @@ static inline void mwait_play_dead(void)
>  			(highest_subcstate - 1);
>  	}
>  
> +	/* Set up state for the kexec() hack below */
> +	md->status = CPUDEAD_MWAIT_WAIT;
> +	md->control = CPUDEAD_MWAIT_WAIT;
> +
>  	wbinvd();
>  
>  	while (1) {
> @@ -1824,10 +1836,57 @@ static inline void mwait_play_dead(void)

JFYI: that last hunk has some conflicts applying to latest tip/master.
Might need merge resolving...

>  		mb();
>  		__mwait(eax, 0);
>  
> +		if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> +			/*
> +			 * Kexec is about to happen. Don't go back into mwait() as
> +			 * the kexec kernel might overwrite text and data including
> +			 * page tables and stack. So mwait() would resume when the
> +			 * monitor cache line is written to and then the CPU goes
> +			 * south due to overwritten text, page tables and stack.
> +			 *
> +			 * Note: This does _NOT_ protect against a stray MCE, NMI,
> +			 * SMI. They will resume execution at the instruction
> +			 * following the HLT instruction and run into the problem
> +			 * which this is trying to prevent.
> +			 */
> +			WRITE_ONCE(md->status, CPUDEAD_MWAIT_KEXEC_HLT);
> +			while(1)
> +				native_halt();
> +		}
> +
>  		cond_wakeup_cpu0();
>  	}
>  }
>  
> +/*
> + * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> + * mwait_play_dead().
> + */
> +void smp_kick_mwait_play_dead(void)
> +{
> +	u32 newstate = CPUDEAD_MWAIT_KEXEC_HLT;

Do you even need this newstate thing?

> +	struct mwait_cpu_dead *md;
> +	unsigned int cpu, i;
> +
> +	for_each_cpu_andnot(cpu, cpu_present_mask, cpu_online_mask) {
> +		md = per_cpu_ptr(&mwait_cpu_dead, cpu);
> +
> +		/* Does it sit in mwait_play_dead() ? */
> +		if (READ_ONCE(md->status) != CPUDEAD_MWAIT_WAIT)
> +			continue;
> +
> +		/* Wait maximal 5ms */

		/* Wait 5ms maximum */

> +		for (i = 0; READ_ONCE(md->status) != newstate && i < 1000; i++) {
> +			/* Bring it out of mwait */
> +			WRITE_ONCE(md->control, newstate);
> +			udelay(5);
> +		}
> +
> +		if (READ_ONCE(md->status) != newstate)
> +			pr_err("CPU%u is stuck in mwait_play_dead()\n", cpu);

Shouldn't this be a pr_err_once thing so that it doesn't flood the
console unnecessarily?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette