[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40b14084af8a35af4e07fbd394821f92d0973d32.camel@intel.com>
Date: Fri, 14 Mar 2025 09:44:53 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "tglx@...utronix.de" <tglx@...utronix.de>, "peterz@...radead.org"
<peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, "Hansen, Dave"
<dave.hansen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"bp@...en8.de" <bp@...en8.de>, "kirill.shutemov@...ux.intel.com"
<kirill.shutemov@...ux.intel.com>
CC: "ashish.kalra@....com" <ashish.kalra@....com>, "dyoung@...hat.com"
<dyoung@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"seanjc@...gle.com" <seanjc@...gle.com>, "dwmw@...zon.co.uk"
<dwmw@...zon.co.uk>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"bhe@...hat.com" <bhe@...hat.com>, "Yamahata, Isaku"
<isaku.yamahata@...el.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
"Chatre, Reinette" <reinette.chatre@...el.com>, "hpa@...or.com"
<hpa@...or.com>, "sagis@...gle.com" <sagis@...gle.com>,
"david.kaplan@....com" <david.kaplan@....com>, "x86@...nel.org"
<x86@...nel.org>, "Williams, Dan J" <dan.j.williams@...el.com>
Subject: Re: [RFC PATCH 2/5] x86/kexec: Do unconditional WBINVD for bare-metal
in relocate_kernel()
On Thu, 2025-03-13 at 23:17 +0000, Edgecombe, Rick P wrote:
> On Thu, 2025-03-13 at 00:34 +1300, Kai Huang wrote:
> > For both SME and TDX, dirty cachelines with and without the encryption
> > bit(s) of the same physical memory address can coexist and the CPU can
> > flush them back to memory in random order.
> >
>
> A lot going on in this sentence, how about simplifying it:
>
> For SME and TDX, multiple dirty cachelines for the same memory can co-exist, and
> the CPU can flush them back to memory in a random order.
"multiple" isn't accurate at least for SME. How about:
For SME and TDX, dirty cachelines with and without encryption bit(s) of
the same memory can coexist, and the CPU can flush them back to memory
in random order.
>
>
> > During kexec, the caches
> > must be flushed before jumping to the new kernel to avoid silent memory
> > corruption to the new kernel.
>
> During kexec, the caches must be flushed before jumping to the new kernel to
> avoid silent memory corruption when a cacheline with a different encryption
> property is written back over whatever encryption properties the new kernel is
> using.
>
> ...it distributes some of the details from the first sentence into the second.
> Easier to read or no? I'm not sure.
I don't have opinion. I see no difference.
I tends to keep the original words since people have reviewed.
>
> >
> > The WBINVD in stop_this_cpu() flushes caches for all remote CPUs when
> > they are being stopped. For SME, the WBINVD in relocate_kernel()
> > flushes the cache for the last running CPU (which is doing kexec).
> >
> > Similarly, to support kexec for TDX host, after stopping all remote CPUs
> > with cache flushed, the kernel needs to flush cache for the last running
> > CPU.
>
>
> I mentioned this in a previous version. I think you need to give some hint to
> where you are going before you start listing facts. Like:
>
> During kexec, WBINVD needs to be executed on each CPU for TDX and SME.
>
> On the
> remote CPUs this is covered in stop_this_cpu() for both TDX and SME. For the
> kexecing CPU, SME handles this in relocate_kernel(). This leaves the TDX case
> for the kexec-ing CPU still to implement.
>
> ...it first says the overall problem to solve, then explains what is missing in
> the current code to solve it. The reader is already thinking of what the
> solutions should be and...
Will do.
>
> >
> > Use the existing WBINVD in relocate_kernel() to cover TDX host as well.
>
> ...they read the solution just as they are wondering about it. The reader can
> feel like the change is aligned with their thinking.
>
> >
> > Just do unconditional
> >
>
> "Unconditional". Now I'm starting to think about how unconditional wbinvd will
> be.
>
> > WBINVD to cover both SME and TDX instead of
> > sprinkling additional vendor-specific checks. Kexec is a slow path, and
> > the additional WBINVD is acceptable for the sake of simplicity and
> > maintainability.
> >
> > But only do WBINVD for bare-metal
> >
>
> But wait, now I'm learning it's not unconditional. I need to re-think what I
> just evaluated. And I'm doubting the earlier statements because I just got
> surprised.
Do you mean you got surprised by saying "unconditional" first and then saying
"for bare-metal"? This is literally what the patch title says. I don't see any
problem, but I can mentioned the "for bare-metal" part when I say
"unconditional" above.
>
> > because TDX guests and SEV-ES/SEV-SNP
> > guests will get unexpected (and yet unnecessary) exception (#VE or #VC)
> > which the kernel is unable to handle at the time of relocate_kernel()
> > since the kernel has torn down the IDT.
> >
> > Remove the host_mem_enc_active local variable and directly use
> > !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) as an argument of calling
> > relocate_kernel().
> >
>
> Start with the problem here. It just describes another change and I'm not sure
> why when I start reading it.
>
> By problem I mean that host_mem_enc_active doesn't fit the conditional anymore,
> so it needs to be changed.
>
> > cpu_feature_enabled() is always inline but not a
>
> I was just noticing this on the other patch. Actually it could call into some
> kasan stuff.
Can you be more specific since I am not seeing it.
>
> > function call, thus it is safe to use after load_segments() when call
> > depth tracking is enabled.
>
> This function call tracking stuff is a wild card at the end. What about
> describing the rules this function needs to follow due to call depth tracking,
> and explain why the change does that.
Below is what I had before. Do you think it's better? I replaced them with the
current one since Reinette commented the original one (which contains history
etc) was not necessary.
"
Commit 93c1800b3799 ("x86/kexec: Fix bug with call depth tracking")
moved calling 'cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)' as an argument
of relocate_kernel() to an earlier place before load_segments() by
adding a variable 'host_mem_enc_active'. The reason was the call to
cc_platform_has() after load_segments() caused a fault and system crash
when call depth tracking is active because load_segments() resets GS to
0 but call depth tracking uses per-CPU variable to operate.
Use !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) to check whether the
kernel runs on bare-metal. cpu_feature_enabled() is always inline but
not a function call, thus it is safe to use it after load_segments()
when call depth tracking is enabled. Remove the 'host_mem_enc_active'
variable and use cpu_feature_enabled() directly as the argument when
calling relocate_kernel().
"
[...]
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -346,16 +346,9 @@ void __nocfi machine_kexec(struct kimage *image)
> > {
> > unsigned long reloc_start = (unsigned long)__relocate_kernel_start;
> > relocate_kernel_fn *relocate_kernel_ptr;
> > - unsigned int host_mem_enc_active;
> > int save_ftrace_enabled;
> > void *control_page;
> >
> > - /*
> > - * This must be done before load_segments() since if call depth tracking
> > - * is used then GS must be valid to make any function calls.
> > - */
> > - host_mem_enc_active = cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT);
> > -
> > #ifdef CONFIG_KEXEC_JUMP
> > if (image->preserve_context)
> > save_processor_state();
> > @@ -398,6 +391,11 @@ void __nocfi machine_kexec(struct kimage *image)
> > *
> > * I take advantage of this here by force loading the
> > * segments, before I zap the gdt with an invalid value.
> > + *
> > + * load_segments() resets GS to 0. Don't make any function call
> > + * after here since call depth tracking uses per-CPU variables to
> > + * operate (relocate_kernel() is explicitly ignored by call depth
> > + * tracking).
>
> I think I suggested you should call out the opportunistic change here in the
> log. Did you disagree?
I replied this was suggested by David Kaplan, but I guess I forgot to reply the
"opportunistic" part.
I don't think this is opportunistic change. It's a valid comment after the
'host_mem_enc_active' variable and the comment around it were removed.
Powered by blists - more mailing lists