lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5948fe2-afbb-4b8c-a83c-0be55a823c53@intel.com>
Date: Thu, 11 Apr 2024 10:55:41 +1200
From: "Huang, Kai" <kai.huang@...el.com>
To: Tom Lendacky <thomas.lendacky@....com>, <linux-kernel@...r.kernel.org>
CC: <x86@...nel.org>, <dave.hansen@...el.com>, <bp@...en8.de>,
	<kirill.shutemov@...ux.intel.com>, <tglx@...utronix.de>, <mingo@...hat.com>,
	<hpa@...or.com>, <luto@...nel.org>, <peterz@...radead.org>,
	<rick.p.edgecombe@...el.com>, <ashish.kalra@....com>, <chao.gao@...el.com>,
	<bhe@...hat.com>, <nik.borisov@...e.com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>
Subject: Re: [PATCH v3 2/5] x86/kexec: do unconditional WBINVD for bare-metal
 in relocate_kernel()



On 11/04/2024 4:21 am, Tom Lendacky wrote:
> On 4/7/24 07:44, Kai Huang wrote:
>> Both SME and TDX can leave caches in incoherent state due to memory
>> encryption.  During kexec, the caches must be flushed before jumping to
>> the second kernel to avoid silent memory corruption to the second kernel.
>>
>> During kexec, the WBINVD in stop_this_cpu() flushes caches for all
>> remote cpus when they are being stopped.  For SME, the WBINVD in
>> relocate_kernel() flushes the cache for the last running cpu (which is
>> executing the kexec).
>>
>> Similarly, to support kexec for TDX host, after stopping all remote cpus
>> with cache flushed, the kernel needs to flush cache for the last running
>> cpu.
>>
>> Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
>>
>> However, instead of sprinkling around vendor-specific checks, just do
>> unconditional WBINVD to cover both SME and TDX.  Kexec is not a fast path
>> so having one additional WBINVD for platforms w/o SME/TDX is acceptable.
>>
>> But only do WBINVD for bare-metal because TDX guests and SEV-ES/SEV-SNP
>> guests will get unexpected (and yet unnecessary) #VE which the kernel is
> 
> s/#VE/#VE or #VC/

Will do "exception (#VE or #VC)".

> 
>> unable to handle at this stage.
>>
>> Signed-off-by: Kai Huang <kai.huang@...el.com>
>> Cc: Tom Lendacky <thomas.lendacky@....com>
>> Cc: Dave Young <dyoung@...hat.com>
>> ---
>>
>> v2 -> v3:
>>   - Change to only do WBINVD for bare metal
>>
>> ---
>>   arch/x86/include/asm/kexec.h         |  2 +-
>>   arch/x86/kernel/machine_kexec_64.c   |  2 +-
>>   arch/x86/kernel/relocate_kernel_64.S | 14 +++++++++-----
>>   3 files changed, 11 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
>> index 91ca9a9ee3a2..455f8a6c66a9 100644
>> --- a/arch/x86/include/asm/kexec.h
>> +++ b/arch/x86/include/asm/kexec.h
>> @@ -128,7 +128,7 @@ relocate_kernel(unsigned long indirection_page,
>>           unsigned long page_list,
>>           unsigned long start_address,
>>           unsigned int preserve_context,
>> -        unsigned int host_mem_enc_active);
>> +        unsigned int bare_metal);
>>   #endif
>>   #define ARCH_HAS_KIMAGE_ARCH
>> diff --git a/arch/x86/kernel/machine_kexec_64.c 
>> b/arch/x86/kernel/machine_kexec_64.c
>> index b180d8e497c3..a454477b7b4c 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -358,7 +358,7 @@ void machine_kexec(struct kimage *image)
>>                          (unsigned long)page_list,
>>                          image->start,
>>                          image->preserve_context,
>> -                       cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
>> +                       !boot_cpu_has(X86_FEATURE_HYPERVISOR));
>>   #ifdef CONFIG_KEXEC_JUMP
>>       if (image->preserve_context)
>> diff --git a/arch/x86/kernel/relocate_kernel_64.S 
>> b/arch/x86/kernel/relocate_kernel_64.S
>> index 56cab1bb25f5..3e04c5e5687f 100644
>> --- a/arch/x86/kernel/relocate_kernel_64.S
>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>> @@ -50,7 +50,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>>        * %rsi page_list
>>        * %rdx start address
>>        * %rcx preserve_context
>> -     * %r8  host_mem_enc_active
>> +     * %r8  bare_metal
>>        */
>>       /* Save the CPU context, used for jumping back */
>> @@ -78,7 +78,7 @@ SYM_CODE_START_NOALIGN(relocate_kernel)
>>       pushq $0
>>       popfq
>> -    /* Save SME active flag */
>> +    /* Save the bare_metal */
> 
> Either "Save bare_metal" or "Save the bare_metal flag"

Will use the latter.  Thanks.

> 
>>       movq    %r8, %r12
>>       /*
>> @@ -160,9 +160,13 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>       movq    %r9, %cr3
>>       /*
>> -     * If SME is active, there could be old encrypted cache line
>> -     * entries that will conflict with the now unencrypted memory
>> -     * used by kexec. Flush the caches before copying the kernel.
>> +     * The kernel could leave caches in incoherent state on SME/TDX
>> +     * capable platforms.  Just do unconditional WBINVD to avoid
>> +     * silent memory corruption to the new kernel for these platforms.
>> +     *
>> +     * But only do WBINVD for bare-metal because TDX guests and
>> +     * SEV-ES/SEV-SNP guests will get #VE which the kernel is unable
>> +     * to handle at this stage.
> 
> Similar comment here about doing an unconditional WBINVD, but then 
> qualifying that as only on guests. This is where talking about how 
> exception handling has been torn down would be good.
> 

OK.

Thinking again, also it might be a good idea to not lose the existing 
comment for SME, because it somehow justifies why we do WBINVD _HERE_ I 
suppose?

How about below?

     /*
      * The kernel could leave caches in incoherent state on SME/TDX
      * capable platforms.  Flush cache to avoid silent memory
      * corruption for these platforms.
      *
      * For SME, need to flush cache here before copying the kernel.
      * When it is active, there could be old encrypted cache line
      * entries that will conflict with the now unencrypted memory
      * used by kexec.
      *
      * Do WBINVD for bare-metal to cover both SME and TDX, as it's
      * not safe to do WBINVD for TDX and SEV-ES/SEV-SNP guests.
      * WBINVD results in exception (#VE or #VC) for these guests, and
      * at this stage kernel is not able to handle such exception any
      * more because the kernel has torn down IDT.
      */

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ