[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5948fe2-afbb-4b8c-a83c-0be55a823c53@intel.com>
Date: Thu, 11 Apr 2024 10:55:41 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Tom Lendacky <thomas.lendacky@....com>, <linux-kernel@...r.kernel.org>
CC: <x86@...nel.org>, <dave.hansen@...el.com>, <bp@...en8.de>,
<kirill.shutemov@...ux.intel.com>, <tglx@...utronix.de>, <mingo@...hat.com>,
<hpa@...or.com>, <luto@...nel.org>, <peterz@...radead.org>,
<rick.p.edgecombe@...el.com>, <ashish.kalra@....com>, <chao.gao@...el.com>,
<bhe@...hat.com>, <nik.borisov@...e.com>, <pbonzini@...hat.com>,
<seanjc@...gle.com>
Subject: Re: [PATCH v3 2/5] x86/kexec: do unconditional WBINVD for bare-metal
in relocate_kernel()
On 11/04/2024 4:21 am, Tom Lendacky wrote:
> On 4/7/24 07:44, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption. During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped. For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, to support kexec for TDX host, after stopping all remote cpus
>> with cache flushed, the kernel needs to flush cache for the last running
>> cpu.
>>
>> Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
>>
>> However, instead of sprinkling around vendor-specific checks, just do
>> unconditional WBINVD to cover both SME and TDX. Kexec is not a fast path
>> so having one additional WBINVD for platforms w/o SME/TDX is acceptable.
>>
>> But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
>> guests will get unexpected (and yet unnecessary) #VE which the kernel is
>
> s/#VE/#VE or #VC/
Will do "exception (#VE or #VC)".
>
>> unable to handle at this stage.
>>
>> Signed-off-by: Kai Huang <kai.huang@...el.com>
>> Cc: Tom Lendacky <thomas.lendacky@....com>
>> Cc: Dave Young <dyoung@...hat.com>
>> ---
>>
>> v2 -> v3:
>> - Change to only do WBINVD for bare metal
>>
>> ---
>> arch/x86/include/asm/kexec.h | 2 +-
>> arch/x86/kernel/machine_kexec_64.c | 2 +-
>> arch/x86/kernel/relocate_kernel_64.S | 14 +++++++++-----
>> 3 files changed, 11 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
>> index 91ca9a9ee3a2..455f8a6c66a9 100644
>> --- a/arch/x86/include/asm/kexec.h
>> +++ b/arch/x86/include/asm/kexec.h
>> @@ -128,7 +128,7 @@ relocate_kernel(unsigned long indirection_page,
>> unsigned long page_list,
>> unsigned long start_address,
>> unsigned int preserve_context,
>> - unsigned int host_mem_enc_active);
>> + unsigned int bare_metal);
>> #endif
>> #define ARCH_HAS_KIMAGE_ARCH
>> diff --git a/arch/x86/kernel/machine_kexec_64.c
>> b/arch/x86/kernel/machine_kexec_64.c
>> index b180d8e497c3..a454477b7b4c 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
>> (unsigned long)page_list,
>> image->start,
>> image->preserve_context,
>> - cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
>> + !boot_cpu_has(X86_FEATURE_HYPERVISOR));
>> #ifdef CONFIG_KEXEC_JUMP
>> if (image->preserve_context)
>> diff --git a/arch/x86/kernel/relocate_kernel_64.S
>> b/arch/x86/kernel/relocate_kernel_64.S
>> index 56cab1bb25f5..3e04c5e5687f 100644
>> --- a/arch/x86/kernel/relocate_kernel_64.S
>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>> @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>> * %rsi page_list
>> * %rdx start address
>> * %rcx preserve_context
>> - * %r8 host_mem_enc_active
>> + * %r8 bare_metal
>> */
>> /* Save the CPU context, used for jumping back */
>> @@ -78,7 +78,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>> pushq $0
>> popfq
>> - /* Save SME active flag */
>> + /* Save the bare_metal */
>
> Either "Save bare_metal" or "Save the bare_metal flag"
Will use the latter. Thanks.
>
>> movq %r8, %r12
>> /*
>> @@ -160,9 +160,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>> movq %r9, %cr3
>> /*
>> - * If SME is active, there could be old encrypted cache line
>> - * entries that will conflict with the now unencrypted memory
>> - * used by kexec. Flush the caches before copying the kernel.
>> + * The kernel could leave caches in incoherent state on SME/TDX
>> + * capable platforms. Just do unconditional WBINVD to avoid
>> + * silent memory corruption to the new kernel for these platforms.
>> + *
>> + * But only do WBINVD for bare-metal because TDX guests and
>> + * SEV-ES/SEV-SNP guests will get #VE which the kernel is unable
>> + * to handle at this stage.
>
> Similar comment here about doing an unconditional WBINVD, but then
> qualifying that as only on guests. This is where talking about how
> exception handling has been torn down would be good.
>
OK.
Thinking again, also it might be a good idea to not lose the existing
comment for SME, because it somehow justifies why we do WBINVD _HERE_ I
suppose?
How about below?
/*
* The kernel could leave caches in incoherent state on SME/TDX
* capable platforms. Flush cache to avoid silent memory
* corruption for these platforms.
*
* For SME, need to flush cache here before copying the kernel.
* When it is active, there could be old encrypted cache line
* entries that will conflict with the now unencrypted memory
* used by kexec.
*
* Do WBINVD for bare-metal to cover both SME and TDX, as it's
* not safe to do WBINVD for TDX and SEV-ES/SEV-SNP guests.
* WBINVD results in exception (#VE or #VC) for these guests, and
* at this stage kernel is not able to handle such exception any
* more because the kernel has torn down IDT.
*/
Powered by blists - more mailing lists