linux-kernel - Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afc875ace6f9f955557f5c7e811b3046278e4c51.camel@intel.com>
Date:   Sun, 3 Dec 2023 11:44:01 +0000
From:   "Huang, Kai" <kai.huang@...el.com>
To:     "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "sathyanarayanan.kuppuswamy@...ux.intel.com" 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "david@...hat.com" <david@...hat.com>,
        "bagasdotme@...il.com" <bagasdotme@...il.com>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "seanjc@...gle.com" <seanjc@...gle.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "Yamahata, Isaku" <isaku.yamahata@...el.com>,
        "nik.borisov@...e.com" <nik.borisov@...e.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "sagis@...gle.com" <sagis@...gle.com>,
        "imammedo@...hat.com" <imammedo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>, "Gao, Chao" <chao.gao@...el.com>,
        "Brown, Len" <len.brown@...el.com>,
        "rafael@...nel.org" <rafael@...nel.org>,
        "Huang, Ying" <ying.huang@...el.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX
 #MC due to erratum

On Fri, 2023-12-01 at 12:35 -0800, Hansen, Dave wrote:
> On 11/9/23 03:55, Kai Huang wrote:
> > +static bool is_pamt_page(unsigned long phys)
> > +{
> > +	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> > +	int i;
> > +
> > +	/*
> > +	 * This function is called from #MC handler, and theoretically
> > +	 * it could run in parallel with the TDX module initialization
> > +	 * on other logical cpus.  But it's not OK to hold mutex here
> > +	 * so just blindly check module status to make sure PAMTs/TDMRs
> > +	 * are stable to access.
> > +	 *
> > +	 * This may return inaccurate result in rare cases, e.g., when
> > +	 * #MC happens on a PAMT page during module initialization, but
> > +	 * this is fine as #MC handler doesn't need a 100% accurate
> > +	 * result.
> > +	 */
> 
> It doesn't need perfect accuracy.  But how do we know it's not going to
> go, for instance, chase a bad pointer?
> 
> > +	if (tdx_module_status != TDX_MODULE_INITIALIZED)
> > +		return false;
> 
> As an example, what prevents this CPU from observing
> tdx_module_status==TDX_MODULE_INITIALIZED while the PAMT structure is
> being assembled?

There are two types of memory order serializing operations between assembling
the TDMR/PAMT structure and setting the tdx_module_status to
TDX_MODULE_INITIALIZED: 1) wbvind_on_all_cpus(); 2) bunch of SEAMCALLs;

WBINVD is a serializing instruction.  SEAMCALL is a VMEXIT to the TDX module,
which involves GDT/LDT/control registers/MSRs switch so it is also a serializing
operation.

But perhaps we can explicitly add a smp_wmb() between assembling TDMR/PAMT
structure and setting tdx_module_status if that's better.

> 
> > +	for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) {
> > +		unsigned long base, size;
> > +
> > +		tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size);
> > +
> > +		if (phys >= base && phys < (base + size))
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * Return whether the memory page at the given physical address is TDX
> > + * private memory or not.  Called from #MC handler do_machine_check().
> > + *
> > + * Note this function may not return an accurate result in rare cases.
> > + * This is fine as the #MC handler doesn't need a 100% accurate result,
> > + * because it cannot distinguish #MC between software bug and real
> > + * hardware error anyway.
> > + */
> > +bool tdx_is_private_mem(unsigned long phys)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = phys & PAGE_MASK,
> > +	};
> > +	u64 sret;
> > +
> > +	if (!platform_tdx_enabled())
> > +		return false;
> > +
> > +	/* Get page type from the TDX module */
> > +	sret = __seamcall_ret(TDH_PHYMEM_PAGE_RDMD, &args);
> > +	/*
> > +	 * Handle the case that CPU isn't in VMX operation.
> > +	 *
> > +	 * KVM guarantees no VM is running (thus no TDX guest)
> > +	 * when there's any online CPU isn't in VMX operation.
> > +	 * This means there will be no TDX guest private memory
> > +	 * and Secure-EPT pages.  However the TDX module may have
> > +	 * been initialized and the memory page could be PAMT.
> > +	 */
> > +	if (sret == TDX_SEAMCALL_UD)
> > +		return is_pamt_page(phys);
> 
> Either this is comment is wonky or the module initialization is buggy.
> 
> config_global_keyid() goes and does SEAMCALLs on all CPUs.  There are
> zero checks or special handling in there for whether the CPU has done
> VMXON.  So, by the time we've started initializing the TDX module
> (including the PAMT), all online CPUs must be able to do SEAMCALLs.  Right?
> 
> So how can we have a working PAMT here when this CPU can't do SEAMCALLs?

The corner case is KVM can enable VMX on all cpus, initialize the TDX module,
and then disable VMX on all cpus.  One example is KVM can be unloaded after it
initializes the TDX module.

In this case CPU cannot do SEAMCALL but PAMTs are already working :-)

However if SEAMCALL cannot be made (due to out of VMX), then the module can only
be initialized or the initialization hasn't been tried, so both
tdx_module_status and the tdx_tdmr_list are stable to access.

> 
> I don't think we should even bother with this complexity.  I think we
> can just fix the whole thing by saying that unless you can make a
> non-init SEAMCALL, we just assume the memory can't be private.
> 
> The transition to being able to make non-init SEAMCALLs is also #MC safe
> *and* it's at a point when the tdmr_list is stable.
> 
> Can anyone shoot any holes in that? :)