linux-kernel - Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd during kexec

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <89d3000e7a4bd4947080607038a89a4e5c38234b.camel@intel.com>
Date: Wed, 2 Jul 2025 03:06:43 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "bp@...en8.de" <bp@...en8.de>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
	"mingo@...hat.com" <mingo@...hat.com>, "hpa@...or.com" <hpa@...or.com>,
	"peterz@...radead.org" <peterz@...radead.org>, "sagis@...gle.com"
	<sagis@...gle.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
	"x86@...nel.org" <x86@...nel.org>, "Williams, Dan J"
	<dan.j.williams@...el.com>
Subject: Re: [PATCH v3 1/6] x86/sme: Use percpu boolean to control wbinvd
 during kexec

On Tue, 2025-07-01 at 14:12 +0200, Borislav Petkov wrote:
> On Mon, Jun 30, 2025 at 11:34:34AM +0000, Huang, Kai wrote:
> > Yeah I agree the text can be improved.  I tried to run AI to simplify but so
> > far I am not quite happy about the result yet.  I'll try more.
> 
> Ask it to simplify it. I get it that you want to be exhaustive in your commit
> message but there really is such thing as too much text.
> 
> Think of it this way: is the text I'm writing optimal when anyone needs to
> read it in the future to know why a change has been done. If not, try to make
> it so.

Yeah this is a good point.  Thanks for that.

After working with AI I came up with below [*].  Basically I:

 - Added a "TL:DR" for quick read, and broke the text into different
sections (e.g., "Background" ..) so it can be read more easily;
 - Removed some trivial things (e.g., the paragraph to explain a comment
change around 'call depth tracking');
 - Simplified the "race" description;
 - Simplified some wording to make sentence shorter.

I appreciate if you can help to see whether it is OK.

Btw, if we remove the "race" description, it could be trimmed down to half,
but I thought it could be still valuable there just in case someone wants to
read it years later.

> 
> > Yeah I agree a single u32 + flags is better.  However this is the problem in
> > the existing code (this patch only does renaming).
> > 
> > I think I can come up with a patch to clean this up and put it as the first
> > patch of this series, or I can do a patch to clean this up after this series
> > (either together with this series, or separately at a later time).  Which
> > way do you prefer?
> 
> Clean ups go first, so yeah, please do a cleanup pre-patch.

Sure will do.  Thanks.

> 
> >   /*
> >    * The cache may be in an incoherent state (e.g., due to memory 
> >    * encryption) and needs flushing during kexec.
> >    */
> 
> Better than nothing. I'd try to explain with 1-2 sentences what can happen due
> to memory encryption and why cache invalidation is required. So that the
> comment is standalone and is not sending you on a wild goose chasing ride.

Yeah agree with the comment being standalone.  How about below?

/*
 * The cache may be in an incoherent state and needs flushing during kexec.
 * E.g., on SME/TDX platforms, dirty cacheline aliases with and without
 * encryption bit(s) can coexist and the cache needs to be flushed before
 * booting to the new kernel to avoid the silent memory corruption due to
 * dirty cachelines with different encryption property being written back
 * to the memory.
 */

> 
> > IIUC the X86_FEATURE_SME could be cleared via 'clearcpuid' kernel cmdline.
> > 
> > Please also see my reply to Tom.
> 
> I know but we have never said that clearcpuid= should be used in production.

Thanks for the info.

> If you break the kernel using it, you get to keep the pieces. clearcpuid=
> taints the kernel and screams bloody murder. So I'm not too worried about
> that.
> 
> What is more relevant is this:
> 
> "I did verify that booting with mem_encrypt=off will start with
> X86_FEATURE_SME set, the BSP will clear it and then all APs will not see
> it set after that."
> 
> which should be put there in the comment.

Yeah agreed.  I read the code again and yeah Tom is right :-)

I didn't want to end up with rewriting the original comment so I came up
with below:

        /*                                              
         * Mark using WBINVD is needed during kexec on processors that     
         * support SME. This provides support for performing a successful
         * kexec when going from SME inactive to SME active (or vice-versa).
         *                                                        
         * The cache must be cleared so that if there are entries with the 
         * same physical address, both with and without the encryption bit,
         * they don't race each other when flushed and potentially end up
         * with the wrong entry being committed to memory.                 
         *                                                              
         * Test the CPUID bit directly because with mem_encrypt=off the    
         * BSP will clear the X86_FEATURE_SME bit and the APs will not     
         * see it set after that.                                      
         */

Does this look good to you?


[*] The updated changelog:

TL;DR:

Use a percpu boolean to control whether to perform WBINVD to unify cache
flushing that is needed during kexec due to memory encryption for both
SME and TDX, and switch SME to use the boolean.

--- Background ---

On SME platforms, dirty cacheline aliases with and without encryption
bit can coexist, and the CPU can flush them back to memory in random
order.  During kexec, the caches must be flushed before jumping to the
new kernel otherwise the dirty cachelines could silently corrupt the
memory used by the new kernel due to different encryption property.

TDX also needs a cache flush during kexec for the same reason.  It would
be good to have a generic way to flush the cache instead of scattering
checks for each feature all around.

When SME is enabled, the kernel basically encrypts all memory including
the kernel itself and a simple memory write from the kernel could dirty
cachelines.  Currently, the kernel uses WBINVD to flush the cache for
SME during kexec in two places:

1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU
   stops them;
2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the
   new kernel.

--- Solution ---

Unlike SME, TDX can only dirty cachelines when it is used (i.e., when
SEAMCALLs are performed).  Since there are no more SEAMCALLs after the
aforementioned WBINVDs, leverage this for TDX.

To unify the approach for SME and TDX, use a percpu boolean to indicate
the cache may be in an incoherent state and needs flushing during kexec,
and set the boolean for SME.  TDX can then leverage it.

While SME could use a global flag (since it's enabled at early boot and
enabled on all CPUs), the percpu flag fits TDX better:

The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when
another WBINVD on the CPU obviates the need for a kexec-time WBINVD.
Saving kexec-time WBINVD is valuable, because there is an existing
race[*] where kexec could proceed while another CPU is active.  WBINVD
could make this race worse, so it's worth skipping it when possible.

--- Side effect to SME ---

Today the first WBINVD in the stop_this_cpu() is performed when SME is
*supported* by the platform, and the second WBINVD is done in
relocate_kernel() when SME is *activated* by the kernel.  Make things
simple by changing to do the second WBINVD when the platform supports
SME.  This allows the kernel to simply turn on this percpu boolean when
bringing up a CPU by checking whether the platform supports SME.

No other functional change intended.

--- More Read ---

[*] The "race" in native_stop_other_cpus()

Commit

  1f5e7eb7868e: ("x86/smp: Make stop_other_cpus() more robust")

introduced a new 'cpus_stop_mask' to resolve an "intermittent lockups on
poweroff" issue which was caused by the WBINVD in stop_this_cpu().
Specifically, the new cpumask resolved the below problem mentioned in
that commit:

    CPU0                                    CPU1

     stop_other_cpus()
       send_IPIs(REBOOT);                   stop_this_cpu()
       while (num_online_cpus() > 1);         set_online(false);
       proceed... -> hang
                                              wbinvd()

While it fixed the reported issue, that commit explained the new cpumask
"cannot plug all holes either", and there's a "race" could still happen:

    CPU 0                                       CPU 1

    native_stop_other_cpus()                    stop_this_cpu()

        // sends REBOOT IPI to stop remote CPUs
                                                ...
                                                wbinvd();

        // wait timesout, try NMI
        if (!cpumask_empty(&cpus_stop_mask)) {
                for_each_cpu(cpu, &cpus_stop_mask) {

                                                ...
                                                cpumask_clear_cpu(cpu,
                                                        &cpus_stop_mask);
                                                hlt;

                        // send NMI     --->
                                                wakeup from hlt and run
                                                stop_this_cpu():

                        // WAIT CPUs TO STOP
                        while (!cpumask_empty(
                            &cpus_stop_mask) &&
                                ...)
                }
                                                ...
                proceed ...                     wbinvd();
                                                ...
                                                hlt;

The "WAIT CPUs TO STOP" is supposed to wait until all remote CPUs are
stopped, but actually it quits immediately because the remote CPUs have
been cleared in cpus_stop_mask when stop_this_cpu() is called from the
REBOOT IPI.

Doing WBINVD in stop_this_cpu() could potentially increase the chance to
trigger the above "race" despite it's still rare to happen.