lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6f959f494f0fb3dedfa963c3d6a0ce7f395b745d.camel@intel.com>
Date:   Tue, 10 Jan 2023 11:29:49 +0000
From:   "Huang, Kai" <kai.huang@...el.com>
To:     "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "Luck, Tony" <tony.luck@...el.com>,
        "bagasdotme@...il.com" <bagasdotme@...il.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "Wysocki, Rafael J" <rafael.j.wysocki@...el.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "Christopherson,, Sean" <seanjc@...gle.com>,
        "Chatre, Reinette" <reinette.chatre@...el.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "Yamahata, Isaku" <isaku.yamahata@...el.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "Shahar, Sagi" <sagis@...gle.com>,
        "imammedo@...hat.com" <imammedo@...hat.com>,
        "Gao, Chao" <chao.gao@...el.com>,
        "Brown, Len" <len.brown@...el.com>,
        "sathyanarayanan.kuppuswamy@...ux.intel.com" 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>
Subject: Re: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is
 enabled

On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote:
> On 12/8/22 22:52, Kai Huang wrote:
> > There are two problems in terms of using kexec() to boot to a new kernel
> > when the old kernel has enabled TDX: 1) Part of the memory pages are
> > still TDX private pages (i.e. metadata used by the TDX module, and any
> > TDX guest memory if kexec() happens when there's any TDX guest alive).
> > 2) There might be dirty cachelines associated with TDX private pages.
> > 
> > Because the hardware doesn't guarantee cache coherency among different
> > KeyIDs, the old kernel needs to flush cache (of those TDX private pages)
> > before booting to the new kernel.  Also, reading TDX private page using
> > any shared non-TDX KeyID with integrity-check enabled can trigger #MC.
> > Therefore ideally, the kernel should convert all TDX private pages back
> > to normal before booting to the new kernel.
> 
> This is just talking about way too many things that just don't apply.
> 
> Let's focus on the *ACTUAL* problem that's being addressed instead of
> the 15 problems that aren't actual practical problems.

Will get rid of those.

> 
> > However, this implementation doesn't convert TDX private pages back to
> > normal in kexec() because of below considerations:
> > 
> > 1) Neither the kernel nor the TDX module has existing infrastructure to
> >    track which pages are TDX private pages.
> > 2) The number of TDX private pages can be large, and converting all of
> >    them (cache flush + using MOVDIR64B to clear the page) in kexec() can
> >    be time consuming.
> > 3) The new kernel will almost only use KeyID 0 to access memory.  KeyID
> >    0 doesn't support integrity-check, so it's OK.
> > 4) The kernel doesn't (and may never) support MKTME.  If any 3rd party
> >    kernel ever supports MKTME, it can/should do MOVDIR64B to clear the
> >    page with the new MKTME KeyID (just like TDX does) before using it.
> 
> Yeah, why are we getting all worked up about MKTME when there is not
> support?

I am not sure whether we need to consider 3rd party kernel case?

> 
> The only thing that matters here is dirty cacheline writeback.  There
> are two things the kernel needs to do to mitigate that:
> 
>  1. Stop accessing TDX private memory mappings
>   1a. Stop making TDX module calls (uses global private KeyID)
>   1b. Stop TDX guests from running (uses per-guest KeyID)
>  2. Flush any cachelines from previous private KeyID writes
> 
> There are a couple of ways we can do #2.  We do *NOT* need to convert
> *ANYTHING* back to KeyID 0.  Page conversion doesn't even come into play
> in any way as far as I can tell.

May I ask why?  When I was writing this patch I was not sure whether kexec()
should give the new kernel a clean slate.  SGX driver doesn't EREMOVE all EPC
during kexec() but depends on the new kernel to do that too, but I don't know
what's the general guide of supporting kexec().

> 
> I think you're also saying that since all CPUs go through this path and
> there is no TDX activity between the WBINVD and the native_halt() that
> 1a and 1b basically happen for "free" without needing to do theme
> explicitly.

Yes.  Should we mention this part in changelog?

AMD SME kexec() support patch bba4ed011a52d ("x86/mm, kexec: Allow kexec to be
used with SME") seems doesn't mention anything similar (SME and TDX may be
different, though).

> 
> > Therefore, this implementation just flushes cache to make sure there are
> > no stale dirty cachelines associated with any TDX private KeyIDs before
> > booting to the new kernel, otherwise they may silently corrupt the new
> > kernel.
> 
> That's fine.  So, this patch kinda happens to land in the right spot
> even after thrashing about about a while.
> 
> > Following SME support, use wbinvd() to flush cache in stop_this_cpu().
> > Theoretically, cache flush is only needed when the TDX module has been
> > initialized.  However initializing the TDX module is done on demand at
> > runtime, and it takes a mutex to read the module status.  Just check
> > whether TDX is enabled by BIOS instead to flush cache.
> 
> Yeah, close enough.
> 
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index c21b7347a26d..0cc84977dc62 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy)
> >  	 *
> >  	 * Test the CPUID bit directly because the machine might've cleared
> >  	 * X86_FEATURE_SME due to cmdline options.
> > +	 *
> > +	 * Similar to SME, if the TDX module is ever initialized, the
> > +	 * cachelines associated with any TDX private KeyID must be flushed
> > +	 * before transiting to the new kernel.  The TDX module is initialized
> > +	 * on demand, and it takes the mutex to read its status.  Just check
> > +	 * whether TDX is enabled by BIOS instead to flush cache.
> >  	 */
> 
> There's too much detail here.  Let's up-level it a bit.  We don't need
> to be talking about TDX locking here.

Sure will do.  Thanks!
> 
> 	/*
> 	 * The TDX module or guests might have left dirty cachelines
> 	 * behind.  Flush them to avoid corruption from later writeback.
> 	 * Note that this flushes on all systems where TDX is possible,
> 	 * but does not actually check that TDX was in use.
> 	 */
> 
> > -	if (cpuid_eax(0x8000001f) & BIT(0))
> > +	if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled())
> >  		native_wbinvd();
> >  	for (;;) {
> >  		/*
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ