linux-kernel - Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a139b1e0-7737-4c20-842c-64457dba17ca@intel.com>
Date: Thu, 1 Feb 2024 22:35:06 +0800
From: "Huang, Kai" <kai.huang@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, <linux-kernel@...r.kernel.org>
CC: <x86@...nel.org>, <kirill.shutemov@...ux.intel.com>, <tglx@...utronix.de>,
	<bp@...en8.de>, <mingo@...hat.com>, <hpa@...or.com>, <luto@...nel.org>,
	<peterz@...radead.org>, <thomas.lendacky@....com>, <chao.gao@...el.com>,
	<bhe@...hat.com>, <nik.borisov@...e.com>, <pbonzini@...hat.com>
Subject: Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms
 with TDX erratum



On 1/02/2024 5:21 am, Dave Hansen wrote:
> On 1/31/24 03:31, Huang, Kai wrote:
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -28,6 +28,7 @@
>>   #include <asm/setup.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/cpu.h>
>> +#include <asm/tdx.h>
>>   
>>   #ifdef CONFIG_ACPI
>>   /*
>> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>>   	void *control_page;
>>   	int save_ftrace_enabled;
>>   
>> +	/*
>> +	 * For platforms with TDX "partial write machine check" erratum,
>> +	 * all TDX private pages need to be converted back to normal
>> +	 * before booting to the new kernel, otherwise the new kernel
>> +	 * may get unexpected machine check.
>> +	 *
>> +	 * But skip this when preserve_context is on.  The second kernel
>> +	 * shouldn't write to the first kernel's memory anyway.  Skipping
>> +	 * this also avoids killing TDX in the first kernel, which would
>> +	 * require more complicated handling.
>> +	 */
> 
> This is waaaaaaaaaaaaaaay too much information to stick in a generic
> function.  Just call tdx_reset_memory() and make the argument about

Hi Dave,

Thanks for review.

Agreed too much info here.  But I don't quite understand your second 
sentence here.  Can I have your suggestion again?

> 
>>   #ifdef CONFIG_KEXEC_JUMP
>>   	if (image->preserve_context)
>>   		save_processor_state();
>> +	else
>> +		tdx_reset_memory();
>> +#else
>> +	tdx_reset_memory();
>>   #endif
> 
> Wow, that's awfully hard to read.  I really wish folks' gag reflex would
> kick in when they see stuff like this to get them to spend an additional
> 15 seconds to turn this into:
> 
> 	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
> 		save_processor_state();
> 	else
> 		tdx_reset_memory();
> 

Yeah this is way better, but as Kirill pointed out we will have build 
error when both SUSPEND and HIBERNATION is off.

Maybe below?

#ifdef CONFIG_KEXEC_JUMP
	if (image->preserve_context)
		save_processor_state();
	else
#endif
	tdx_reset_memory();


>>   	save_ftrace_enabled = __ftrace_enabled_save();
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 9f1fed458a32..0537b1b76c2b 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -28,6 +28,7 @@
>>   #include <linux/acpi.h>
>>   #include <linux/suspend.h>
>>   #include <linux/acpi.h>
>> +#include <linux/reboot.h>
>>   #include <asm/page.h>
>>   #include <asm/special_insns.h>
>>   #include <asm/msr-index.h>
>> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>>   /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
>>   static LIST_HEAD(tdx_memlist);
>>   
>> +static bool tdx_rebooting;
>> +
>>   typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>>   
>>   static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
>> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>>   {
>>   	int ret;
>>   
>> +	if (tdx_rebooting)
>> +		return -EAGAIN;
> 
> This whole 'tdx_rebooting' deserves to at least be in its own patch.
> Also -EAGAIN seems to be a really odd return code for a permanent failure.

Will move to a separate patch.

How about just -EINVAL?

> 
>>   	ret = init_tdx_module();
>>   	if (ret) {
>>   		pr_err("module initialization failed (%d)\n", ret);
>> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>>   	.notifier_call = tdx_memory_notifier,
>>   };
>>   
>> +/*
>> + * Convert TDX private pages back to normal on platforms with
>> + * "partial write machine check" erratum.
>> + *
>> + * Called from machine_kexec() before booting to the new kernel.
>> + */
>> +void tdx_reset_memory(void)
>> +{
>> +	if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
>> +		return;
>> +
>> +	/*
>> +	 * Kernel read/write to TDX private memory doesn't
>> +	 * cause machine check on hardware w/o this erratum.
>> +	 */
>> +	if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>> +		return;
>> +
>> +	/* Called from kexec() when only rebooting cpu is alive */
>> +	WARN_ON_ONCE(num_online_cpus() != 1);
>> +
>> +	/*
>> +	 * tdx_reboot_notifier() waits until ongoing TDX module
>> +	 * initialization to finish, and module initialization is
>> +	 * rejected after that.  Therefore @tdx_module_status is
>> +	 * stable here and can be read w/o holding lock.
>> +	 */
>> +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
>> +		return;
> 
> kexec() can't happen until after reboot notifiers are called.
> tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
> needed.
> 
> Right?

Yes.  I can replace the comment with your words.

> 
>> +	/*
>> +	 * Flush cache of all TDX private memory _before_ converting
>> +	 * them back to avoid silent memory corruption.
>> +	 */
>> +	native_wbinvd();
> 
> Since this is single-threaded, it also needs to mention where all the
> other CPU caches got flushed.

OK will point out.

> 
>> +	/*
>> +	 * Convert PAMTs back to normal.  All other cpus are already
>> +	 * dead and TDMRs/PAMTs are stable.
>> +	 *
>> +	 * Ideally it's better to cover all types of TDX private pages
>> +	 * here, but it's impractical:
>> +	 *
>> +	 *  - There's no existing infrastructure to tell whether a page
>> +	 *    is TDX private memory or not.
>> +	 *
>> +	 *  - Using SEAMCALL to query TDX module isn't feasible either:
>> +	 *    - VMX has been turned off by reaching here so SEAMCALL
>> +	 *      cannot be made;
>> +	 *    - Even SEAMCALL can be made the result from TDX module may
> 
> 		    ^ if     ^ a          ^,
> 

Thanks.  Will add.

>> +	 *      not be accurate (e.g., remote CPU can be stopped while
>> +	 *      the kernel is in the middle of reclaiming TDX private
>> +	 *      page and doing MOVDIR64B).
>> +	 *
>> +	 * One temporary solution could be just converting all memory
>> +	 * pages, but it's problematic too, because not all pages are
>> +	 * mapped as writable in direct mapping.  It can be done by
>> +	 * switching to the identical mapping for kexec() or a new page
>> +	 * table which maps all pages as writable, but the complexity is
>> +	 * overkill.
>> +	 *
>> +	 * Thus instead of doing something dramatic to convert all pages,
>> +	 * only convert PAMTs here.  Other kernel components which use
>> +	 * TDX need to do the conversion on their own by intercepting the
>> +	 * rebooting/shutdown notifier (KVM already does that).
>> +	 */
> 
> I'd leave the extended alternatives discussion in the changelog, not here.
> 
> Focus on what _this_ is doing and why it is imperfect:
> 
>   1. Just reset the PAMDs
>   2. This leaves A, B, and C undealt with
>   3. The risk of leaving those is ...
> 

Agreed. Will do.

> 
>> +	tdmrs_reset_pamt_all(&tdx_tdmr_list);
>> +}
>> +
>> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
>> +			       void *unused)
>> +{
>> +	/* Wait ongoing TDX initialization to finish */
>> +	mutex_lock(&tdx_module_lock);
>> +	tdx_rebooting = true;
>> +	mutex_unlock(&tdx_module_lock);
>> +
>> +	return NOTIFY_OK;
>> +}
> 
> Why is 'tdx_rebooting' a new variable instead of a new state in
> tdx_module_status?
> 

I think we can but I believe using tdx_rebooting is more clear.

For instance, if we want to add a new state TDX_MODULE_ABORT for this 
case, when tdx_reboot_notifier() is called after module initialization, 
then the tdx_module_status (which is already TDX_MODULE_INITIALIZED) 
will be overwritten and we will lose the information that module 
initialization has been done successfully.

We may be able to avoid these using more complicated logic but I think 
using a separate tdx_rebooting is straightforward.