lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7c16241-100a-4830-9628-65edb44ca78d@suse.com>
Date: Mon, 19 Aug 2024 18:09:06 +0300
From: Nikolay Borisov <nik.borisov@...e.com>
To: Rick Edgecombe <rick.p.edgecombe@...el.com>, seanjc@...gle.com,
 pbonzini@...hat.com, kvm@...r.kernel.org
Cc: kai.huang@...el.com, isaku.yamahata@...il.com,
 tony.lindgren@...ux.intel.com, xiaoyao.li@...el.com,
 linux-kernel@...r.kernel.org, Isaku Yamahata <isaku.yamahata@...el.com>,
 Sean Christopherson <sean.j.christopherson@...el.com>,
 Yan Zhao <yan.y.zhao@...el.com>
Subject: Re: [PATCH 13/25] KVM: TDX: create/destroy VM structure



On 13.08.24 г. 1:48 ч., Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@...el.com>
> 
> Implement managing the TDX private KeyID to implement, create, destroy
> and free for a TDX guest.
> 
> When creating at TDX guest, assign a TDX private KeyID for the TDX guest
> for memory encryption, and allocate pages for the guest. These are used
> for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS).
> 
> On destruction, free the allocated pages, and the KeyID.
> 
> Before tearing down the private page tables, TDX requires the guest TD to
> be destroyed by reclaiming the KeyID. Do it at vm_destroy() kvm_x86_ops
> hook.
> 
> Add a call for vm_free() at the end of kvm_arch_destroy_vm() because the
> per-VM TDR needs to be freed after the KeyID.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Co-developed-by: Kai Huang <kai.huang@...el.com>
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> ---

<snip>


> @@ -19,14 +20,14 @@ static const struct tdx_sysinfo *tdx_sysinfo;
>   /* TDX KeyID pool */
>   static DEFINE_IDA(tdx_guest_keyid_pool);
>   
> -static int __used tdx_guest_keyid_alloc(void)
> +static int tdx_guest_keyid_alloc(void)
>   {
>   	return ida_alloc_range(&tdx_guest_keyid_pool, tdx_guest_keyid_start,
>   			       tdx_guest_keyid_start + tdx_nr_guest_keyids - 1,
>   			       GFP_KERNEL);
>   }
>   
> -static void __used tdx_guest_keyid_free(int keyid)
> +static void tdx_guest_keyid_free(int keyid)
>   {
>   	ida_free(&tdx_guest_keyid_pool, keyid);
>   }
> @@ -73,6 +74,305 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>   	return r;
>   }
>   
> +/*
> + * Some SEAMCALLs acquire the TDX module globally, and can fail with
> + * TDX_OPERAND_BUSY.  Use a global mutex to serialize these SEAMCALLs.
> + */
> +static DEFINE_MUTEX(tdx_lock);

The way this lock is used is very ugly. So it essentially mimics a lock 
which already lives in the tdx module. So why not simply gracefully 
handle the TDX_OPERAND_BUSY return value or change the interface of the 
module (yeah, it's probably late for this now) so expose the lock. This 
lock breaks one of the main rules of locking - "Lock data and not code"

> +
> +/* Maximum number of retries to attempt for SEAMCALLs. */
> +#define TDX_SEAMCALL_RETRIES	10000
> +
> +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> +{
> +	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> +}
> +
> +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->tdr_pa;
> +}
> +
> +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> +{
> +	tdx_guest_keyid_free(kvm_tdx->hkid);
> +	kvm_tdx->hkid = -1;
> +}
> +
> +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->hkid > 0;
> +}
> +
> +static void tdx_clear_page(unsigned long page_pa)
> +{
> +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> +	void *page = __va(page_pa);
> +	unsigned long i;
> +
> +	/*
> +	 * The page could have been poisoned.  MOVDIR64B also clears
> +	 * the poison bit so the kernel can safely use the page again.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += 64)
> +		movdir64b(page + i, zero_page);
> +	/*
> +	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
> +	 * from seeing potentially poisoned cache.
> +	 */
> +	__mb();
> +}
> +
> +static u64 ____tdx_reclaim_page(hpa_t pa, u64 *rcx, u64 *rdx, u64 *r8)

Just inline this into its sole caller. Yes each specific function is 
rather small but if you have to go through several levels of indirection 
then there's no point in splitting it...


> +{
> +	u64 err;
> +	int i;
> +
> +	for (i = TDX_SEAMCALL_RETRIES; i > 0; i--) {
> +		err = tdh_phymem_page_reclaim(pa, rcx, rdx, r8);
> +		switch (err) {
> +		case TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX:
> +		case TDX_OPERAND_BUSY | TDX_OPERAND_ID_TDR:
> +			cond_resched();
> +			continue;
> +		default:
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	return err;
> +}
> +

<snip>

> +
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> +	bool packages_allocated, targets_allocated;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages, targets;
> +	u64 err;
> +	int i;
> +
> +	if (!is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	/* KeyID has been allocated but guest is not yet configured */
> +	if (!is_td_created(kvm_tdx)) {
> +		tdx_hkid_free(kvm_tdx);
> +		return;
> +	}
> +
> +	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> +	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> +	cpus_read_lock();
> +
> +	/*
> +	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock
> +	 * and can fail with TDX_OPERAND_BUSY when it fails to get the lock.
> +	 * Multiple TDX guests can be destroyed simultaneously. Take the
> +	 * mutex to prevent it from getting error.
> +	 */
> +	mutex_lock(&tdx_lock);
> +
> +	/*
> +	 * We need three SEAMCALLs, TDH.MNG.VPFLUSHDONE(), TDH.PHYMEM.CACHE.WB(),
> +	 * and TDH.MNG.KEY.FREEID() to free the HKID. When the HKID is assigned,
> +	 * we need to use TDH.MEM.SEPT.REMOVE() or TDH.MEM.PAGE.REMOVE(). When
> +	 * the HKID is free, we need to use TDH.PHYMEM.PAGE.RECLAIM().  Get lock
> +	 * to not present transient state of HKID.
> +	 */
> +	write_lock(&kvm->mmu_lock);
> +
> +	for_each_online_cpu(i) {
> +		if (packages_allocated &&
> +		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +					     packages))
> +			continue;
> +		if (targets_allocated)
> +			cpumask_set_cpu(i, targets);
> +	}
> +	if (targets_allocated)
> +		on_each_cpu_mask(targets, smp_func_do_phymem_cache_wb, NULL, true);
> +	else
> +		on_each_cpu(smp_func_do_phymem_cache_wb, NULL, true);
> +	/*
> +	 * In the case of error in smp_func_do_phymem_cache_wb(), the following
> +	 * tdh_mng_key_freeid() will fail.
> +	 */
> +	err = tdh_mng_key_freeid(kvm_tdx);
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_MNG_KEY_FREEID, err);
> +		pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
> +		       kvm_tdx->hkid);
> +	} else {
> +		tdx_hkid_free(kvm_tdx);
> +	}
> +
> +	write_unlock(&kvm->mmu_lock);
> +	mutex_unlock(&tdx_lock);
> +	cpus_read_unlock();
> +	free_cpumask_var(targets);
> +	free_cpumask_var(packages);
> +}
> +
> +static inline u8 tdx_sysinfo_nr_tdcs_pages(void)
> +{
> +	return tdx_sysinfo->td_ctrl.tdcs_base_size / PAGE_SIZE;
> +}

Just add a nr_tdcs_pages to struct tdx_sysinfo_td_ctrl and claculate 
this value in get_tdx_td_ctrl() rather than having this long-named 
non-sense. This value can't be calculated at compiletime anyway.

> +
> +void tdx_vm_free(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err;
> +	int i;
> +
> +	/*
> +	 * tdx_mmu_release_hkid() failed to reclaim HKID.  Something went wrong
> +	 * heavily with TDX module.  Give up freeing TD pages.  As the function
> +	 * already warned, don't warn it again.
> +	 */
> +	if (is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	if (kvm_tdx->tdcs_pa) {
> +		for (i = 0; i < tdx_sysinfo_nr_tdcs_pages(); i++) {
> +			if (!kvm_tdx->tdcs_pa[i])
> +				continue;
> +
> +			tdx_reclaim_control_page(kvm_tdx->tdcs_pa[i]);
> +		}
> +		kfree(kvm_tdx->tdcs_pa);
> +		kvm_tdx->tdcs_pa = NULL;
> +	}
> +
> +	if (!kvm_tdx->tdr_pa)
> +		return;

Use is_td_created() helper. Also isn't this check redundant since you've 
already executed is_hkid_assigned() and if the VM is not properly 
created i.e __tdx_td_init() has failed for whatever reason then the 
is_hkid_assigned check will also fail?

> +
> +	if (__tdx_reclaim_page(kvm_tdx->tdr_pa))
> +		return;
> +
> +	/*
> +	 * Use a SEAMCALL to ask the TDX module to flush the cache based on the
> +	 * KeyID. TDX module may access TDR while operating on TD (Especially
> +	 * when it is reclaiming TDCS).
> +	 */
> +	err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(kvm_tdx->tdr_pa,
> +						     tdx_global_keyid));
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +		return;
> +	}
> +	tdx_clear_page(kvm_tdx->tdr_pa);
> +
> +	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
> +	kvm_tdx->tdr_pa = 0;
> +}
> +

<snip>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ