linux-kernel - Re: [PATCH v5 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX SEAMCALLs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aIx7Qlpi1Y/VsRVY@intel.com>
Date: Fri, 1 Aug 2025 16:30:58 +0800
From: Chao Gao <chao.gao@...el.com>
To: Kai Huang <kai.huang@...el.com>
CC: <dave.hansen@...el.com>, <bp@...en8.de>, <tglx@...utronix.de>,
	<peterz@...radead.org>, <mingo@...hat.com>, <hpa@...or.com>,
	<thomas.lendacky@....com>, <x86@...nel.org>, <kas@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dwmw@...zon.co.uk>,
	<linux-kernel@...r.kernel.org>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<kvm@...r.kernel.org>, <reinette.chatre@...el.com>,
	<isaku.yamahata@...el.com>, <dan.j.williams@...el.com>,
	<ashish.kalra@....com>, <nik.borisov@...e.com>, <sagis@...gle.com>, "Farrah
 Chen" <farrah.chen@...el.com>, Binbin Wu <binbin.wu@...ux.intel.com>
Subject: Re: [PATCH v5 7/7] KVM: TDX: Explicitly do WBINVD when no more TDX
 SEAMCALLs

On Tue, Jul 29, 2025 at 12:28:41AM +1200, Kai Huang wrote:
>On TDX platforms, during kexec, the kernel needs to make sure there are
>no dirty cachelines of TDX private memory before booting to the new
>kernel to avoid silent memory corruption to the new kernel.
>
>During kexec, the kexec-ing CPU firstly invokes native_stop_other_cpus()
>to stop all remote CPUs before booting to the new kernel.  The remote
>CPUs will then execute stop_this_cpu() to stop themselves.
>
>The kernel has a percpu boolean to indicate whether the cache of a CPU
>may be in incoherent state.  In stop_this_cpu(), the kernel does WBINVD
>if that percpu boolean is true.
>
>TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL.
>This makes sure the caches will be flushed during kexec.
>
>However, the native_stop_other_cpus() and stop_this_cpu() have a "race"
>which is extremely rare to happen but could cause the system to hang.
>
>Specifically, the native_stop_other_cpus() firstly sends normal reboot
>IPI to remote CPUs and waits one second for them to stop.  If that times
>out, native_stop_other_cpus() then sends NMIs to remote CPUs to stop
>them.
>
>The aforementioned race happens when NMIs are sent.  Doing WBINVD in
>stop_this_cpu() makes each CPU take longer time to stop and increases
>the chance of the race happening.
>
>Explicitly flush cache in tdx_disable_virtualization_cpu() after which
>no more TDX activity can happen on this cpu.  This moves the WBINVD to
>an earlier stage than stop_this_cpus(), avoiding a possibly lengthy
>operation at a time where it could cause this race.
>
>Signed-off-by: Kai Huang <kai.huang@...el.com>
>Acked-by: Paolo Bonzini <pbonzini@...hat.com>
>Tested-by: Farrah Chen <farrah.chen@...el.com>
>Reviewed-by: Binbin Wu <binbin.wu@...ux.intel.com>

Flushing cache after disabling virtualization looks clean. So,

Reviewed-by: Chao Gao <chao.gao@...el.com>