linux-kernel - Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <12a927df-e437-40ff-ba4d-ceca5446c5b1@intel.com>
Date: Wed, 31 Jan 2024 13:21:39 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: "Huang, Kai" <kai.huang@...el.com>, linux-kernel@...r.kernel.org
Cc: x86@...nel.org, kirill.shutemov@...ux.intel.com, tglx@...utronix.de,
 bp@...en8.de, mingo@...hat.com, hpa@...or.com, luto@...nel.org,
 peterz@...radead.org, thomas.lendacky@....com, chao.gao@...el.com,
 bhe@...hat.com, nik.borisov@...e.com, pbonzini@...hat.com
Subject: Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms
 with TDX erratum

On 1/31/24 03:31, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
>  #include <asm/setup.h>
>  #include <asm/set_memory.h>
>  #include <asm/cpu.h>
> +#include <asm/tdx.h>
>  
>  #ifdef CONFIG_ACPI
>  /*
> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>  	void *control_page;
>  	int save_ftrace_enabled;
>  
> +	/*
> +	 * For platforms with TDX "partial write machine check" erratum,
> +	 * all TDX private pages need to be converted back to normal
> +	 * before booting to the new kernel, otherwise the new kernel
> +	 * may get unexpected machine check.
> +	 *
> +	 * But skip this when preserve_context is on.  The second kernel
> +	 * shouldn't write to the first kernel's memory anyway.  Skipping
> +	 * this also avoids killing TDX in the first kernel, which would
> +	 * require more complicated handling.
> +	 */

This is waaaaaaaaaaaaaaay too much information to stick in a generic
function.  Just call tdx_reset_memory() and make the argument about

>  #ifdef CONFIG_KEXEC_JUMP
>  	if (image->preserve_context)
>  		save_processor_state();
> +	else
> +		tdx_reset_memory();
> +#else
> +	tdx_reset_memory();
>  #endif

Wow, that's awfully hard to read.  I really wish folks' gag reflex would
kick in when they see stuff like this to get them to spend an additional
15 seconds to turn this into:

	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
		save_processor_state();
	else
		tdx_reset_memory();

>  	save_ftrace_enabled = __ftrace_enabled_save();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9f1fed458a32..0537b1b76c2b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -28,6 +28,7 @@
>  #include <linux/acpi.h>
>  #include <linux/suspend.h>
>  #include <linux/acpi.h>
> +#include <linux/reboot.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>  /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>  static LIST_HEAD(tdx_memlist);
>  
> +static bool tdx_rebooting;
> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>  {
>  	int ret;
>  
> +	if (tdx_rebooting)
> +		return -EAGAIN;

This whole 'tdx_rebooting' deserves to at least be in its own patch.
Also -EAGAIN seems to be a really odd return code for a permanent failure.

>  	ret = init_tdx_module();
>  	if (ret) {
>  		pr_err("module initialization failed (%d)\n", ret);
> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>  	.notifier_call = tdx_memory_notifier,
>  };
>  
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
> +		return;
> +
> +	/*
> +	 * Kernel read/write to TDX private memory doesn't
> +	 * cause machine check on hardware w/o this erratum.
> +	 */
> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> +		return;
> +
> +	/* Called from kexec() when only rebooting cpu is alive */
> +	WARN_ON_ONCE(num_online_cpus() != 1);
> +
> +	/*
> +	 * tdx_reboot_notifier() waits until ongoing TDX module
> +	 * initialization to finish, and module initialization is
> +	 * rejected after that.  Therefore @tdx_module_status is
> +	 * stable here and can be read w/o holding lock.
> +	 */
> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> +		return;

kexec() can't happen until after reboot notifiers are called.
tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
needed.

Right?

> +	/*
> +	 * Flush cache of all TDX private memory _before_ converting
> +	 * them back to avoid silent memory corruption.
> +	 */
> +	native_wbinvd();

Since this is single-threaded, it also needs to mention where all the
other CPU caches got flushed.

> +	/*
> +	 * Convert PAMTs back to normal.  All other cpus are already
> +	 * dead and TDMRs/PAMTs are stable.
> +	 *
> +	 * Ideally it's better to cover all types of TDX private pages
> +	 * here, but it's impractical:
> +	 *
> +	 *  - There's no existing infrastructure to tell whether a page
> +	 *    is TDX private memory or not.
> +	 *
> +	 *  - Using SEAMCALL to query TDX module isn't feasible either:
> +	 *    - VMX has been turned off by reaching here so SEAMCALL
> +	 *      cannot be made;
> +	 *    - Even SEAMCALL can be made the result from TDX module may

		    ^ if     ^ a          ^,

> +	 *      not be accurate (e.g., remote CPU can be stopped while
> +	 *      the kernel is in the middle of reclaiming TDX private
> +	 *      page and doing MOVDIR64B).
> +	 *
> +	 * One temporary solution could be just converting all memory
> +	 * pages, but it's problematic too, because not all pages are
> +	 * mapped as writable in direct mapping.  It can be done by
> +	 * switching to the identical mapping for kexec() or a new page
> +	 * table which maps all pages as writable, but the complexity is
> +	 * overkill.
> +	 *
> +	 * Thus instead of doing something dramatic to convert all pages,
> +	 * only convert PAMTs here.  Other kernel components which use
> +	 * TDX need to do the conversion on their own by intercepting the
> +	 * rebooting/shutdown notifier (KVM already does that).
> +	 */

I'd leave the extended alternatives discussion in the changelog, not here.

Focus on what _this_ is doing and why it is imperfect:

 1. Just reset the PAMDs
 2. This leaves A, B, and C undealt with
 3. The risk of leaving those is ...


> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
> +			       void *unused)
> +{
> +	/* Wait ongoing TDX initialization to finish */
> +	mutex_lock(&tdx_module_lock);
> +	tdx_rebooting = true;
> +	mutex_unlock(&tdx_module_lock);
> +
> +	return NOTIFY_OK;
> +}

Why is 'tdx_rebooting' a new variable instead of a new state in
tdx_module_status?