linux-kernel - Re: [PATCH v13 019/113] KVM: TDX: create/destroy VM structure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230402114158.00000543@gmail.com>
Date:   Sun, 2 Apr 2023 11:41:58 +0300
From:   Zhi Wang <zhi.wang.linux@...il.com>
To:     Isaku Yamahata <isaku.yamahata@...il.com>
Cc:     isaku.yamahata@...el.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>,
        erdemaktas@...gle.com, Sean Christopherson <seanjc@...gle.com>,
        Sagi Shahar <sagis@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Kai Huang <kai.huang@...el.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>
Subject: Re: [PATCH v13 019/113] KVM: TDX: create/destroy VM structure

On Wed, 29 Mar 2023 18:01:53 -0700
Isaku Yamahata <isaku.yamahata@...il.com> wrote:

> Hi, thanks for review.
> 
> On Sun, Mar 26, 2023 at 02:09:36PM +0300,
> Zhi Wang <zhi.wang.linux@...il.com> wrote:
> 
> > On Sun, 12 Mar 2023 10:55:43 -0700
> > isaku.yamahata@...el.com wrote:
> 
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 8b02e605cfb5..3ede8a726b47 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -5,8 +5,9 @@
> > >  
> > >  #include "capabilities.h"
> > >  #include "x86_ops.h"
> > > -#include "x86.h"
> > >  #include "tdx.h"
> > > +#include "tdx_ops.h"
> > > +#include "x86.h"
> > >  
> > >  #undef pr_fmt
> > >  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > @@ -46,11 +47,276 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> > >  	return r;
> > >  }
> > >  
> > > +struct tdx_info {
> > > +	u8 nr_tdcs_pages;
> > > +};
> > > +
> > > +/* Info about the TDX module. */
> > > +static struct tdx_info tdx_info;
> > > +
> > > +/*
> > > + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
> > > + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
> > > + * internally in TDX module.  If failed, TDX_OPERAND_BUSY is returned without
> > > + * spinning or waiting due to a constraint on execution time.  It's caller's
> > > + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY).  Use this mutex
> > > + * to avoid race in TDX module because the kernel knows better about scheduling.
> > > + */
> > > +static DEFINE_MUTEX(tdx_lock);
> > > +static struct mutex *tdx_mng_key_config_lock;
> > > +
> > > +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > > +{
> > > +	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> > > +}
> > > +
> > > +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> > > +{
> > > +	return kvm_tdx->tdr_pa;
> > > +}
> > > +
> > > +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> > > +{
> > > +	tdx_guest_keyid_free(kvm_tdx->hkid);
> > > +	kvm_tdx->hkid = 0;
> > > +}
> > > +
> > > +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> > > +{
> > > +	return kvm_tdx->hkid > 0;
> > > +}
> > > +
> > >  int tdx_hardware_enable(void)
> > >  {
> > >  	return tdx_cpu_enable();
> > >  }
> > >  
> > > +static void tdx_clear_page(unsigned long page_pa)
> > > +{
> > > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > +	void *page = __va(page_pa);
> > > +	unsigned long i;
> > > +
> > > +	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > > +		clear_page(page);
> > > +		return;
> > > +	}
> > 
> > Is it possbile to have a TDX machine without MOVDIR64B support? I am not sure
> > if there is any other way for the kernel to clear the posioned cache line.
> > 
> > If not, there should be a warn/bug at least and check if MOVDIR64B support
> > when initializing the TDX.
> 
> Because the latest TDX specification uses movdir64b, it's safe for TDX
> to assume movdir64b.
> I'll add the check to TDX initialization part and drop it from here.
> 
> 
> > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > +{
> > > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > +	cpumask_var_t packages;
> > > +	bool cpumask_allocated;
> > > +	u64 err;
> > > +	int ret;
> > > +	int i;
> > > +
> > > +	if (!is_hkid_assigned(kvm_tdx))
> > > +		return;
> > > +
> > > +	if (!is_td_created(kvm_tdx))
> > > +		goto free_hkid;
> > > +
> > > +	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > > +	cpus_read_lock();
> > > +	for_each_online_cpu(i) {
> > > +		if (cpumask_allocated &&
> > > +			cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > > +						packages))
> > > +			continue;
> > 
> > Is this necessary to check cpumask_allocated in the while loop? if cpumask
> > is not succefully allocated, wouldn't it be better to bail out just after
> > it?
> 
> No because we can't return error here.  It's better to do in-efficiently freeing
> resources instead of leak.
> 
> We can move the check out of loop. But it would be ugly
> (if () {cpu loop} else {cpu loop} ) and this function isn't performance
> critical.  Also I think it's okay to depend on compiler optimization for loop
> invariant. My compiler didn't optimize it in this case, though.
> 

Do you mean the tdh_mng_key_freeid() is still required if failing to allocate
the cpumask var and do TDH.PHYMEM_CACHE_WB(WBINVD) on each CPU?

Out of curiosity, I took a look on the TDX module source code [1], it seems TDX
module has an additional check in TDH.MNG.KEY.FREEID. TDH.MNG.VPFLUSHDONE [2]
will mark the pending wbinvd in a bitmap:

...
/**
     * Create the WBINVD_BITMAP per-package.
     * Set to 1 num_of_pkgs bits from the LSB
     */
    global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap = global_data_ptr->pkg_config_bitmap; /* <----HERE */

    // Set new TD life cycle state
    tdr_ptr->management_fields.lifecycle_state = TD_BLOCKED;

    // Set the proper new KOT entry state
    global_data_ptr->kot.entries[curr_hkid].state = (uint8_t)KOT_STATE_HKID_FLUSHED; 
...

And TDH.MNG.KEY.FREEID [3] will check if the pending WBINVD has been performed:

...
    /**
     * If TDH_PHYMEM_CACHE_WB was executed on all packages/cores,
     * set the KOT entry, set the KOT entry state to HKID_FREE.
     */
    curr_hkid = tdr_ptr->key_management_fields.hkid;
    tdx_debug_assert(global_data_ptr->kot.entr/ies[curr_hkid].state == KOT_STATE_HKID_FLUSHED);
    if (global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap != 0) /* HERE */
    {
        TDX_ERROR("CACHEWB is not complete for this HKID (=%x)\n", curr_hkid);
        return_val = TDX_WBCACHE_NOT_COMPLETE;
        goto EXIT;
    }
...

Guess the conclusion is: if TDH.PHYMEM.CACHE.WB is not performed on each
required CPU correctly, TDH.MNG.KEY.FREEID will fail as well. A leak seems
the only option (none of us likes a leak, but...).

It would be better that:

1) Leave a comment about the finding above in the code to explain why leak
happens (It is always nice to explain the reason of a leak). One sentence
will be good enough.

2) If failing to allocate the cpumask, bail out (with the findings above)

...
cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
if (!cpumask_allocated)
	return;
...

3) As the reason of the leak will be explained in tdx_mmu_release_hkid(),
leaving a pointer in the comment in tdx_vm_free() to refer to
tdx_mmu_release_hkid() would be enough, like: 

...
/* 
 * Can't reclaim or free TD pages if teardown failed in
 * tdx_mmu_release_hkid().
 */
if (is_hkid_assigned(kvm_tdx))
	return;
...

[1] https://downloadmirror.intel.com/738876/tdx-module-v1.0.01.01.zip
[2] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_vpflushdone.c
[3] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_key_freeid.c

> 
> > > +
> > > +		/*
> > > +		 * We can destroy multiple the guest TDs simultaneously.
> > > +		 * Prevent tdh_phymem_cache_wb from returning TDX_BUSY by
> > > +		 * serialization.
> > > +		 */
> > > +		mutex_lock(&tdx_lock);
> > > +		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
> > > +		mutex_unlock(&tdx_lock);
> > > +		if (ret)
> > > +			break;
> > > +	}
> > > +	cpus_read_unlock();
> > > +	free_cpumask_var(packages);
> > > +
> > > +	mutex_lock(&tdx_lock);
> > > +	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> > > +	mutex_unlock(&tdx_lock);
> > > +	if (WARN_ON_ONCE(err)) {
> > > +		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> > > +		pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n",
> > > +			kvm_tdx->hkid);
> > > +		return;
> > > +	}
> > > +
> > > +free_hkid:
> > > +	tdx_hkid_free(kvm_tdx);
> > > +}
> > > +
> > > +void tdx_vm_free(struct kvm *kvm)
> > > +{
> > > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > +	int i;
> > > +
> > > +	/* Can't reclaim or free TD pages if teardown failed. */
> > > +	if (is_hkid_assigned(kvm_tdx))
> > > +		return;
> > > +
> > 
> > Better to explain why, as it is common to think even the teardown failed, we
> > should still try to reclaim the pages as many as we can.
> 
> Ok, here is the updated comment.
>  /*
>   * tdx_mmu_release_hkid() failed to reclaim HKID.  Something went wrong
>   * heavily with TDX module.  Give up freeing TD pages.  As the function
>   * already warned, don't warn it again.
>   */