linux-kernel - Re: [PATCH v13 019/113] KVM: TDX: create/destroy VM structure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230330010153.GB1112017@ls.amr.corp.intel.com>
Date:   Wed, 29 Mar 2023 18:01:53 -0700
From:   Isaku Yamahata <isaku.yamahata@...il.com>
To:     Zhi Wang <zhi.wang.linux@...il.com>
Cc:     isaku.yamahata@...el.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, isaku.yamahata@...il.com,
        Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
        Sean Christopherson <seanjc@...gle.com>,
        Sagi Shahar <sagis@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Kai Huang <kai.huang@...el.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>
Subject: Re: [PATCH v13 019/113] KVM: TDX: create/destroy VM structure

Hi, thanks for review.

On Sun, Mar 26, 2023 at 02:09:36PM +0300,
Zhi Wang <zhi.wang.linux@...il.com> wrote:

> On Sun, 12 Mar 2023 10:55:43 -0700
> isaku.yamahata@...el.com wrote:

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 8b02e605cfb5..3ede8a726b47 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -5,8 +5,9 @@
> >  
> >  #include "capabilities.h"
> >  #include "x86_ops.h"
> > -#include "x86.h"
> >  #include "tdx.h"
> > +#include "tdx_ops.h"
> > +#include "x86.h"
> >  
> >  #undef pr_fmt
> >  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > @@ -46,11 +47,276 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> >  	return r;
> >  }
> >  
> > +struct tdx_info {
> > +	u8 nr_tdcs_pages;
> > +};
> > +
> > +/* Info about the TDX module. */
> > +static struct tdx_info tdx_info;
> > +
> > +/*
> > + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
> > + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
> > + * internally in TDX module.  If failed, TDX_OPERAND_BUSY is returned without
> > + * spinning or waiting due to a constraint on execution time.  It's caller's
> > + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY).  Use this mutex
> > + * to avoid race in TDX module because the kernel knows better about scheduling.
> > + */
> > +static DEFINE_MUTEX(tdx_lock);
> > +static struct mutex *tdx_mng_key_config_lock;
> > +
> > +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > +{
> > +	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> > +}
> > +
> > +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->tdr_pa;
> > +}
> > +
> > +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> > +{
> > +	tdx_guest_keyid_free(kvm_tdx->hkid);
> > +	kvm_tdx->hkid = 0;
> > +}
> > +
> > +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->hkid > 0;
> > +}
> > +
> >  int tdx_hardware_enable(void)
> >  {
> >  	return tdx_cpu_enable();
> >  }
> >  
> > +static void tdx_clear_page(unsigned long page_pa)
> > +{
> > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > +	void *page = __va(page_pa);
> > +	unsigned long i;
> > +
> > +	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > +		clear_page(page);
> > +		return;
> > +	}
> 
> Is it possbile to have a TDX machine without MOVDIR64B support? I am not sure
> if there is any other way for the kernel to clear the posioned cache line.
> 
> If not, there should be a warn/bug at least and check if MOVDIR64B support
> when initializing the TDX.

Because the latest TDX specification uses movdir64b, it's safe for TDX
to assume movdir64b.
I'll add the check to TDX initialization part and drop it from here.


> > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	cpumask_var_t packages;
> > +	bool cpumask_allocated;
> > +	u64 err;
> > +	int ret;
> > +	int i;
> > +
> > +	if (!is_hkid_assigned(kvm_tdx))
> > +		return;
> > +
> > +	if (!is_td_created(kvm_tdx))
> > +		goto free_hkid;
> > +
> > +	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > +	cpus_read_lock();
> > +	for_each_online_cpu(i) {
> > +		if (cpumask_allocated &&
> > +			cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > +						packages))
> > +			continue;
> 
> Is this necessary to check cpumask_allocated in the while loop? if cpumask
> is not succefully allocated, wouldn't it be better to bail out just after
> it?

No because we can't return error here.  It's better to do in-efficiently freeing
resources instead of leak.

We can move the check out of loop. But it would be ugly
(if () {cpu loop} else {cpu loop} ) and this function isn't performance
critical.  Also I think it's okay to depend on compiler optimization for loop
invariant. My compiler didn't optimize it in this case, though.


> > +
> > +		/*
> > +		 * We can destroy multiple the guest TDs simultaneously.
> > +		 * Prevent tdh_phymem_cache_wb from returning TDX_BUSY by
> > +		 * serialization.
> > +		 */
> > +		mutex_lock(&tdx_lock);
> > +		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
> > +		mutex_unlock(&tdx_lock);
> > +		if (ret)
> > +			break;
> > +	}
> > +	cpus_read_unlock();
> > +	free_cpumask_var(packages);
> > +
> > +	mutex_lock(&tdx_lock);
> > +	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> > +	mutex_unlock(&tdx_lock);
> > +	if (WARN_ON_ONCE(err)) {
> > +		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> > +		pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n",
> > +			kvm_tdx->hkid);
> > +		return;
> > +	}
> > +
> > +free_hkid:
> > +	tdx_hkid_free(kvm_tdx);
> > +}
> > +
> > +void tdx_vm_free(struct kvm *kvm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	int i;
> > +
> > +	/* Can't reclaim or free TD pages if teardown failed. */
> > +	if (is_hkid_assigned(kvm_tdx))
> > +		return;
> > +
> 
> Better to explain why, as it is common to think even the teardown failed, we
> should still try to reclaim the pages as many as we can.

Ok, here is the updated comment.
 /*
  * tdx_mmu_release_hkid() failed to reclaim HKID.  Something went wrong
  * heavily with TDX module.  Give up freeing TD pages.  As the function
  * already warned, don't warn it again.
  */
-- 
Isaku Yamahata <isaku.yamahata@...il.com>