[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a139b1e0-7737-4c20-842c-64457dba17ca@intel.com>
Date: Thu, 1 Feb 2024 22:35:06 +0800
From: "Huang, Kai" <kai.huang@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, <linux-kernel@...r.kernel.org>
CC: <x86@...nel.org>, <kirill.shutemov@...ux.intel.com>, <tglx@...utronix.de>,
<bp@...en8.de>, <mingo@...hat.com>, <hpa@...or.com>, <luto@...nel.org>,
<peterz@...radead.org>, <thomas.lendacky@....com>, <chao.gao@...el.com>,
<bhe@...hat.com>, <nik.borisov@...e.com>, <pbonzini@...hat.com>
Subject: Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms
with TDX erratum
On 1/02/2024 5:21 am, Dave Hansen wrote:
> On 1/31/24 03:31, Huang, Kai wrote:
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -28,6 +28,7 @@
>> #include <asm/setup.h>
>> #include <asm/set_memory.h>
>> #include <asm/cpu.h>
>> +#include <asm/tdx.h>
>>
>> #ifdef CONFIG_ACPI
>> /*
>> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
>> void *control_page;
>> int save_ftrace_enabled;
>>
>> + /*
>> + * For platforms with TDX "partial write machine check" erratum,
>> + * all TDX private pages need to be converted back to normal
>> + * before booting to the new kernel, otherwise the new kernel
>> + * may get unexpected machine check.
>> + *
>> + * But skip this when preserve_context is on. The second kernel
>> + * shouldn't write to the first kernel's memory anyway. Skipping
>> + * this also avoids killing TDX in the first kernel, which would
>> + * require more complicated handling.
>> + */
>
> This is waaaaaaaaaaaaaaay too much information to stick in a generic
> function. Just call tdx_reset_memory() and make the argument about
Hi Dave,
Thanks for review.
Agreed too much info here. But I don't quite understand your second
sentence here. Can I have your suggestion again?
>
>> #ifdef CONFIG_KEXEC_JUMP
>> if (image->preserve_context)
>> save_processor_state();
>> + else
>> + tdx_reset_memory();
>> +#else
>> + tdx_reset_memory();
>> #endif
>
> Wow, that's awfully hard to read. I really wish folks' gag reflex would
> kick in when they see stuff like this to get them to spend an additional
> 15 seconds to turn this into:
>
> if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
> save_processor_state();
> else
> tdx_reset_memory();
>
Yeah this is way better, but as Kirill pointed out we will have build
error when both SUSPEND and HIBERNATION is off.
Maybe below?
#ifdef CONFIG_KEXEC_JUMP
if (image->preserve_context)
save_processor_state();
else
#endif
tdx_reset_memory();
>> save_ftrace_enabled = __ftrace_enabled_save();
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 9f1fed458a32..0537b1b76c2b 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -28,6 +28,7 @@
>> #include <linux/acpi.h>
>> #include <linux/suspend.h>
>> #include <linux/acpi.h>
>> +#include <linux/reboot.h>
>> #include <asm/page.h>
>> #include <asm/special_insns.h>
>> #include <asm/msr-index.h>
>> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
>> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
>> static LIST_HEAD(tdx_memlist);
>>
>> +static bool tdx_rebooting;
>> +
>> typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>>
>> static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
>> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
>> {
>> int ret;
>>
>> + if (tdx_rebooting)
>> + return -EAGAIN;
>
> This whole 'tdx_rebooting' deserves to at least be in its own patch.
> Also -EAGAIN seems to be a really odd return code for a permanent failure.
Will move to a separate patch.
How about just -EINVAL?
>
>> ret = init_tdx_module();
>> if (ret) {
>> pr_err("module initialization failed (%d)\n", ret);
>> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
>> .notifier_call = tdx_memory_notifier,
>> };
>>
>> +/*
>> + * Convert TDX private pages back to normal on platforms with
>> + * "partial write machine check" erratum.
>> + *
>> + * Called from machine_kexec() before booting to the new kernel.
>> + */
>> +void tdx_reset_memory(void)
>> +{
>> + if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
>> + return;
>> +
>> + /*
>> + * Kernel read/write to TDX private memory doesn't
>> + * cause machine check on hardware w/o this erratum.
>> + */
>> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
>> + return;
>> +
>> + /* Called from kexec() when only rebooting cpu is alive */
>> + WARN_ON_ONCE(num_online_cpus() != 1);
>> +
>> + /*
>> + * tdx_reboot_notifier() waits until ongoing TDX module
>> + * initialization to finish, and module initialization is
>> + * rejected after that. Therefore @tdx_module_status is
>> + * stable here and can be read w/o holding lock.
>> + */
>> + if (tdx_module_status != TDX_MODULE_INITIALIZED)
>> + return;
>
> kexec() can't happen until after reboot notifiers are called.
> tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
> needed.
>
> Right?
Yes. I can replace the comment with your words.
>
>> + /*
>> + * Flush cache of all TDX private memory _before_ converting
>> + * them back to avoid silent memory corruption.
>> + */
>> + native_wbinvd();
>
> Since this is single-threaded, it also needs to mention where all the
> other CPU caches got flushed.
OK will point out.
>
>> + /*
>> + * Convert PAMTs back to normal. All other cpus are already
>> + * dead and TDMRs/PAMTs are stable.
>> + *
>> + * Ideally it's better to cover all types of TDX private pages
>> + * here, but it's impractical:
>> + *
>> + * - There's no existing infrastructure to tell whether a page
>> + * is TDX private memory or not.
>> + *
>> + * - Using SEAMCALL to query TDX module isn't feasible either:
>> + * - VMX has been turned off by reaching here so SEAMCALL
>> + * cannot be made;
>> + * - Even SEAMCALL can be made the result from TDX module may
>
> ^ if ^ a ^,
>
Thanks. Will add.
>> + * not be accurate (e.g., remote CPU can be stopped while
>> + * the kernel is in the middle of reclaiming TDX private
>> + * page and doing MOVDIR64B).
>> + *
>> + * One temporary solution could be just converting all memory
>> + * pages, but it's problematic too, because not all pages are
>> + * mapped as writable in direct mapping. It can be done by
>> + * switching to the identical mapping for kexec() or a new page
>> + * table which maps all pages as writable, but the complexity is
>> + * overkill.
>> + *
>> + * Thus instead of doing something dramatic to convert all pages,
>> + * only convert PAMTs here. Other kernel components which use
>> + * TDX need to do the conversion on their own by intercepting the
>> + * rebooting/shutdown notifier (KVM already does that).
>> + */
>
> I'd leave the extended alternatives discussion in the changelog, not here.
>
> Focus on what _this_ is doing and why it is imperfect:
>
> 1. Just reset the PAMDs
> 2. This leaves A, B, and C undealt with
> 3. The risk of leaving those is ...
>
Agreed. Will do.
>
>> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
>> +}
>> +
>> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
>> + void *unused)
>> +{
>> + /* Wait ongoing TDX initialization to finish */
>> + mutex_lock(&tdx_module_lock);
>> + tdx_rebooting = true;
>> + mutex_unlock(&tdx_module_lock);
>> +
>> + return NOTIFY_OK;
>> +}
>
> Why is 'tdx_rebooting' a new variable instead of a new state in
> tdx_module_status?
>
I think we can but I believe using tdx_rebooting is more clear.
For instance, if we want to add a new state TDX_MODULE_ABORT for this
case, when tdx_reboot_notifier() is called after module initialization,
then the tdx_module_status (which is already TDX_MODULE_INITIALIZED)
will be overwritten and we will lose the information that module
initialization has been done successfully.
We may be able to avoid these using more complicated logic but I think
using a separate tdx_rebooting is straightforward.
Powered by blists - more mailing lists