linux-kernel - Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot notifier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1fee2f710a69ac678c17835519bed352e4224ff2.camel@intel.com>
Date: Wed, 2 Jul 2025 09:22:43 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "Gao, Chao" <chao.gao@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...hat.com" <mingo@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
	"hpa@...or.com" <hpa@...or.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "sagis@...gle.com" <sagis@...gle.com>, "Chen, Farrah"
	<farrah.chen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>, "Williams,
 Dan J" <dan.j.williams@...el.com>
Subject: Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot
 notifier

On Wed, 2025-07-02 at 15:54 +0800, Chao Gao wrote:
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > }
> > EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> > +
> > +void tdx_cpu_flush_cache(void)
> > +{
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	wbinvd();
> 
> Shouldn't you check the per-CPU variable first? so that WBINVD can be
> skipped if there is nothing incoherent.

It is assumed the caller of this function knows that cache needs to be
flushed.  But I can do the additional check and skip the wbinvd().

> 
> And reboot notifier looks the wrong place for WBINVD. Because SEAMCALLs
> (see below) called after the reboot notifier will set the per-CPU variable
> again. So in some cases, this patch will result in an *extra* WBINVD
> instead of moving WBINVD to an earlier stage.

I agree.  Me and Rick had some discussion around here and this patch can
still bring optimization "in most cases", i.e., the real user of kexec
normally will just do the kexec when no TD is running.

To make it complete we should manually kill all TDs upon rebooting notifier.

I should call that out in the changelog though.

> 
> kernel_kexec()
>   ->kernel_restart_prepare()
>     ->blocking_notifier_call_chain() // reboot notifier
>   ->syscore_shutdown()
>     -> ...
>       ->tdx_disable_virtualization_cpu()
>         ->tdx_flush_vp()
> 
> > +	this_cpu_write(cache_state_incoherent, false);
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
> 
> I wonder why we don't simply perform WBINVD in
> vt_disable_virtualization_cpu() after VMXOFF, i.e.,
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..1ad3c28b8eff 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -19,6 +19,8 @@ static void vt_disable_virtualization_cpu(void)
> 	if (enable_tdx)
> 		tdx_disable_virtualization_cpu();
> 	vmx_disable_virtualization_cpu();
> +	/* Explain why WBINVD is needed */
> +	if (enable_tdx)
> +		wbinvd();
>  }
> 
>  static __init int vt_hardware_setup(void)
> 
> It can solve the cache line aliasing problem and is much simpler than
> patches 1-2 and 6.

This sounds promising, but the emergency virtualization part needs similar
handling too.

Previously the VMXOFF in the emergency virtualization was explicitly done in
the core kernel, so for this approach we have to scatter checks for
different vendors at different places.  This wasn't nice.

Now the emergency virtualization disable itself is implemented in KVM too
(the core kernel only calls a function pointer), so I _think_ the situation
is better now, if we do wbinvd() in VMX disabling path.

It will get more complicated when other kernel components (VT-d) need to
invoke SEAMCALL, but at that time VMX code should have been moved to core
kernel, so doing WBINVD after VMXOFF sounds fine too.

One concern is this approach doesn't play quite nice with below pattern:

	cpu_enable_virtualization();
	SEAMCALL();
	cpu_disable_virtualization();

but hopefully VT-d code doesn't need to do like this.

The percpu boolean approach does have another advantage (although it's more
like theoretical issue), that it could be also used to cover other cases
that could also lead to cache being in incoherent:

https://lore.kernel.org/lkml/eb2e3b02-cf5e-4848-8f1d-9f3af8f9c96b@intel.com/

I'll think more on this.  I will be out for the rest of the week so I will
come back next week.