lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 10 Apr 2024 11:21:22 -0500
From: Tom Lendacky <thomas.lendacky@....com>
To: Kai Huang <kai.huang@...el.com>, linux-kernel@...r.kernel.org
Cc: x86@...nel.org, dave.hansen@...el.com, bp@...en8.de,
 kirill.shutemov@...ux.intel.com, tglx@...utronix.de, mingo@...hat.com,
 hpa@...or.com, luto@...nel.org, peterz@...radead.org,
 rick.p.edgecombe@...el.com, ashish.kalra@....com, chao.gao@...el.com,
 bhe@...hat.com, nik.borisov@...e.com, pbonzini@...hat.com, seanjc@...gle.com
Subject: Re: [PATCH v3 2/5] x86/kexec: do unconditional WBINVD for bare-metal
 in relocate_kernel()

On 4/7/24 07:44, Kai Huang wrote:
> Both SME and TDX can leave caches in incoherent state due to memory
> encryption.  During kexec, the caches must be flushed before jumping to
> the second kernel to avoid silent memory corruption to the second kernel.
> 
> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
> remote cpus when they are being stopped.  For SME, the WBINVD in
> relocate_kernel() flushes the cache for the last running cpu (which is
> executing the kexec).
> 
> Similarly, to support kexec for TDX host, after stopping all remote cpus
> with cache flushed, the kernel needs to flush cache for the last running
> cpu.
> 
> Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
> 
> However, instead of sprinkling around vendor-specific checks, just do
> unconditional WBINVD to cover both SME and TDX.  Kexec is not a fast path
> so having one additional WBINVD for platforms w/o SME/TDX is acceptable.
> 
> But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
> guests will get unexpected (and yet unnecessary) #VE which the kernel is

s/#VE/#VE or #VC/

> unable to handle at this stage.
> 
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> Cc: Tom Lendacky <thomas.lendacky@....com>
> Cc: Dave Young <dyoung@...hat.com>
> ---
> 
> v2 -> v3:
>   - Change to only do WBINVD for bare metal
> 
> ---
>   arch/x86/include/asm/kexec.h         |  2 +-
>   arch/x86/kernel/machine_kexec_64.c   |  2 +-
>   arch/x86/kernel/relocate_kernel_64.S | 14 +++++++++-----
>   3 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 91ca9a9ee3a2..455f8a6c66a9 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -128,7 +128,7 @@ relocate_kernel(unsigned long indirection_page,
>   		unsigned long page_list,
>   		unsigned long start_address,
>   		unsigned int preserve_context,
> -		unsigned int host_mem_enc_active);
> +		unsigned int bare_metal);
>   #endif
>   
>   #define ARCH_HAS_KIMAGE_ARCH
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index b180d8e497c3..a454477b7b4c 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
>   				       (unsigned long)page_list,
>   				       image->start,
>   				       image->preserve_context,
> -				       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> +				       !boot_cpu_has(X86_FEATURE_HYPERVISOR));
>   
>   #ifdef CONFIG_KEXEC_JUMP
>   	if (image->preserve_context)
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..3e04c5e5687f 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>   	 * %rsi page_list
>   	 * %rdx start address
>   	 * %rcx preserve_context
> -	 * %r8  host_mem_enc_active
> +	 * %r8  bare_metal
>   	 */
>   
>   	/* Save the CPU context, used for jumping back */
> @@ -78,7 +78,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>   	pushq $0
>   	popfq
>   
> -	/* Save SME active flag */
> +	/* Save the bare_metal */

Either "Save bare_metal" or "Save the bare_metal flag"

>   	movq	%r8, %r12
>   
>   	/*
> @@ -160,9 +160,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>   	movq	%r9, %cr3
>   
>   	/*
> -	 * If SME is active, there could be old encrypted cache line
> -	 * entries that will conflict with the now unencrypted memory
> -	 * used by kexec. Flush the caches before copying the kernel.
> +	 * The kernel could leave caches in incoherent state on SME/TDX
> +	 * capable platforms.  Just do unconditional WBINVD to avoid
> +	 * silent memory corruption to the new kernel for these platforms.
> +	 *
> +	 * But only do WBINVD for bare-metal because TDX guests and
> +	 * SEV-ES/SEV-SNP guests will get #VE which the kernel is unable
> +	 * to handle at this stage.

Similar comment here about doing an unconditional WBINVD, but then 
qualifying that as only on guests. This is where talking about how 
exception handling has been torn down would be good.

So for the confusion around the WBINVD in patches 1 and 2.

Thanks,
Tom

>   	 */
>   	testq	%r12, %r12
>   	jz 1f

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ