lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1fee2f710a69ac678c17835519bed352e4224ff2.camel@intel.com>
Date: Wed, 2 Jul 2025 09:22:43 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "Gao, Chao" <chao.gao@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "ashish.kalra@....com"
	<ashish.kalra@....com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mingo@...hat.com" <mingo@...hat.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "kirill.shutemov@...ux.intel.com"
	<kirill.shutemov@...ux.intel.com>, "Chatre, Reinette"
	<reinette.chatre@...el.com>, "nik.borisov@...e.com" <nik.borisov@...e.com>,
	"hpa@...or.com" <hpa@...or.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "sagis@...gle.com" <sagis@...gle.com>, "Chen, Farrah"
	<farrah.chen@...el.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	"bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>, "Williams,
 Dan J" <dan.j.williams@...el.com>
Subject: Re: [PATCH v3 6/6] KVM: TDX: Explicitly do WBINVD upon reboot
 notifier

On Wed, 2025-07-02 at 15:54 +0800, Chao Gao wrote:
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1870,3 +1870,12 @@ u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page)
> > 	return seamcall(TDH_PHYMEM_PAGE_WBINVD, &args);
> > }
> > EXPORT_SYMBOL_GPL(tdh_phymem_page_wbinvd_hkid);
> > +
> > +void tdx_cpu_flush_cache(void)
> > +{
> > +	lockdep_assert_preemption_disabled();
> > +
> > +	wbinvd();
> 
> Shouldn't you check the per-CPU variable first? so that WBINVD can be
> skipped if there is nothing incoherent.

It is assumed the caller of this function knows that cache needs to be
flushed.  But I can do the additional check and skip the wbinvd().

> 
> And reboot notifier looks the wrong place for WBINVD. Because SEAMCALLs
> (see below) called after the reboot notifier will set the per-CPU variable
> again. So in some cases, this patch will result in an *extra* WBINVD
> instead of moving WBINVD to an earlier stage.

I agree.  Me and Rick had some discussion around here and this patch can
still bring optimization "in most cases", i.e., the real user of kexec
normally will just do the kexec when no TD is running.

To make it complete we should manually kill all TDs upon rebooting notifier.

I should call that out in the changelog though.

> 
> kernel_kexec()
>   ->kernel_restart_prepare()
>     ->blocking_notifier_call_chain() // reboot notifier
>   ->syscore_shutdown()
>     -> ...
>       ->tdx_disable_virtualization_cpu()
>         ->tdx_flush_vp()
> 
> > +	this_cpu_write(cache_state_incoherent, false);
> > +}
> > +EXPORT_SYMBOL_GPL(tdx_cpu_flush_cache);
> 
> I wonder why we don't simply perform WBINVD in
> vt_disable_virtualization_cpu() after VMXOFF, i.e.,
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..1ad3c28b8eff 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -19,6 +19,8 @@ static void vt_disable_virtualization_cpu(void)
> 	if (enable_tdx)
> 		tdx_disable_virtualization_cpu();
> 	vmx_disable_virtualization_cpu();
> +	/* Explain why WBINVD is needed */
> +	if (enable_tdx)
> +		wbinvd();
>  }
> 
>  static __init int vt_hardware_setup(void)
> 
> It can solve the cache line aliasing problem and is much simpler than
> patches 1-2 and 6.

This sounds promising, but the emergency virtualization part needs similar
handling too.

Previously the VMXOFF in the emergency virtualization was explicitly done in
the core kernel, so for this approach we have to scatter checks for
different vendors at different places.  This wasn't nice.

Now the emergency virtualization disable itself is implemented in KVM too
(the core kernel only calls a function pointer), so I _think_ the situation
is better now, if we do wbinvd() in VMX disabling path.

It will get more complicated when other kernel components (VT-d) need to
invoke SEAMCALL, but at that time VMX code should have been moved to core
kernel, so doing WBINVD after VMXOFF sounds fine too.

One concern is this approach doesn't play quite nice with below pattern:

	cpu_enable_virtualization();
	SEAMCALL();
	cpu_disable_virtualization();

but hopefully VT-d code doesn't need to do like this.

The percpu boolean approach does have another advantage (although it's more
like theoretical issue), that it could be also used to cover other cases
that could also lead to cache being in incoherent:

https://lore.kernel.org/lkml/eb2e3b02-cf5e-4848-8f1d-9f3af8f9c96b@intel.com/

I'll think more on this.  I will be out for the rest of the week so I will
come back next week.





Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ