linux-kernel - Re: [PATCH v3 08/16] x86/virt/tdx: Optimize tdx_alloc/free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86ab9923-624d-4950-abea-46780e94c6ce@linux.intel.com>
Date: Wed, 24 Sep 2025 14:15:11 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Rick Edgecombe <rick.p.edgecombe@...el.com>
Cc: kas@...nel.org, bp@...en8.de, chao.gao@...el.com,
 dave.hansen@...ux.intel.com, isaku.yamahata@...el.com, kai.huang@...el.com,
 kvm@...r.kernel.org, linux-coco@...ts.linux.dev,
 linux-kernel@...r.kernel.org, mingo@...hat.com, pbonzini@...hat.com,
 seanjc@...gle.com, tglx@...utronix.de, x86@...nel.org, yan.y.zhao@...el.com,
 vannapurve@...gle.com, "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH v3 08/16] x86/virt/tdx: Optimize tdx_alloc/free_page()
 helpers



On 9/19/2025 7:22 AM, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
>
> Optimize the PAMT alloc/free helpers to avoid taking the global lock when
> possible.
>
> The recently introduced PAMT alloc/free helpers maintain a refcount to
> keep track of when it is ok to reclaim and free a 4KB PAMT page. This
> refcount is protected by a global lock in order to guarantee that races
> don’t result in the PAMT getting freed while another caller requests it
> be mapped. But a global lock is a bit heavyweight, especially since the
> refcounts can be (already are) updated atomically.
>
> A simple approach would be to increment/decrement the refcount outside of
> the lock before actually adjusting the PAMT, and only adjust the PAMT if
> the refcount transitions from/to 0. This would correctly allocate and free
> the PAMT page without getting out of sync. But there it leaves a race
> where a simultaneous caller could see the refcount already incremented and
> return before it is actually mapped.
>
> So treat the refcount 0->1 case as a special case. On add, if the refcount
> is zero *don’t* increment the refcount outside the lock (to 1). Always
> take the lock in that case and only set the refcount to 1 after the PAMT
> is actually added. This way simultaneous adders, when PAMT is not
> installed yet, will take the slow lock path.
>
> On the 1->0 case, it is ok to return from tdx_pamt_put() when the DPAMT is
> not actually freed yet, so the basic approach works. Just decrement the
> refcount before  taking the lock. Only do the lock and removal of the PAMT
> when the refcount goes to zero.
>
> There is an asymmetry between tdx_pamt_get() and tdx_pamt_put() in that
> tdx_pamt_put() goes 1->0 outside the lock, but tdx_pamt_put() does 0-1
                                                      ^
                                                 tdx_pamt_get() ?
> inside the lock. Because of this, there is a special race where
> tdx_pamt_put() could decrement the refcount to zero before the PAMT is
> actually removed, and tdx_pamt_get() could try to do a PAMT.ADD when the
> page is already mapped. Luckily the TDX module will tell return a special
> error that tells us we hit this case. So handle it specially by looking
> for the error code.
>
> The optimization is a little special, so make the code extra commented
> and verbose.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> [Clean up code, update log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> ---
> v3:
>   - Split out optimization from “x86/virt/tdx: Add tdx_alloc/free_page() helpers”
>   - Remove edge case handling that I could not find a reason for
>   - Write log
> ---
>   arch/x86/include/asm/shared/tdx_errno.h |  2 ++
>   arch/x86/virt/vmx/tdx/tdx.c             | 46 +++++++++++++++++++++----
>   2 files changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/shared/tdx_errno.h b/arch/x86/include/asm/shared/tdx_errno.h
> index 49ab7ecc7d54..4bc0b9c9e82b 100644
> --- a/arch/x86/include/asm/shared/tdx_errno.h
> +++ b/arch/x86/include/asm/shared/tdx_errno.h
> @@ -21,6 +21,7 @@
>   #define TDX_PREVIOUS_TLB_EPOCH_BUSY		0x8000020100000000ULL
>   #define TDX_RND_NO_ENTROPY			0x8000020300000000ULL
>   #define TDX_PAGE_METADATA_INCORRECT		0xC000030000000000ULL
> +#define TDX_HPA_RANGE_NOT_FREE			0xC000030400000000ULL
>   #define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
>   #define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
>   #define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
> @@ -100,6 +101,7 @@ DEFINE_TDX_ERRNO_HELPER(TDX_SUCCESS);
>   DEFINE_TDX_ERRNO_HELPER(TDX_RND_NO_ENTROPY);
>   DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_INVALID);
>   DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_BUSY);
> +DEFINE_TDX_ERRNO_HELPER(TDX_HPA_RANGE_NOT_FREE);
>   DEFINE_TDX_ERRNO_HELPER(TDX_VCPU_NOT_ASSOCIATED);
>   DEFINE_TDX_ERRNO_HELPER(TDX_FLUSHVP_NOT_DONE);
>   
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index af73b6c2e917..c25e238931a7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -2117,7 +2117,7 @@ int tdx_pamt_get(struct page *page)
>   	u64 pamt_pa_array[MAX_DPAMT_ARG_SIZE];
>   	atomic_t *pamt_refcount;
>   	u64 tdx_status;
> -	int ret;
> +	int ret = 0;
>   
>   	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
>   		return 0;
> @@ -2128,14 +2128,40 @@ int tdx_pamt_get(struct page *page)
>   
>   	pamt_refcount = tdx_find_pamt_refcount(hpa);
>   
> +	if (atomic_inc_not_zero(pamt_refcount))
> +		goto out_free;
> +
>   	scoped_guard(spinlock, &pamt_lock) {
> -		if (atomic_read(pamt_refcount))
> +		/*
> +		 * Lost race to other tdx_pamt_add(). Other task has already allocated
> +		 * PAMT memory for the HPA.
> +		 */
> +		if (atomic_read(pamt_refcount)) {
> +			atomic_inc(pamt_refcount);
>   			goto out_free;
> +		}
>   
>   		tdx_status = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pa_array);
>   
>   		if (IS_TDX_SUCCESS(tdx_status)) {
> +			/*
> +			 * The refcount is zero, and this locked path is the only way to
> +			 * increase it from 0-1. If the PAMT.ADD was successful, set it
> +			 * to 1 (obviously).
> +			 */
> +			atomic_set(pamt_refcount, 1);
> +		} else if (IS_TDX_HPA_RANGE_NOT_FREE(tdx_status)) {
> +			/*
> +			 * Less obviously, another CPU's call to tdx_pamt_put() could have
> +			 * decremented the refcount before entering its lock section.
> +			 * In this case, the PAMT is not actually removed yet. Luckily
> +			 * TDX module tells about this case, so increment the refcount
> +			 * 0-1, so tdx_pamt_put() skips its pending PAMT.REMOVE.
> +			 *
> +			 * The call didn't need the pages though, so free them.
> +			 */
>   			atomic_inc(pamt_refcount);
> +			goto out_free;
>   		} else {
>   			pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
>   			goto out_free;
> @@ -2167,15 +2193,23 @@ void tdx_pamt_put(struct page *page)
>   
>   	pamt_refcount = tdx_find_pamt_refcount(hpa);
>   
> +	/*
> +	 * Unlike the paired call in tdx_pamt_get(), decrement the refcount
> +	 * outside the lock even if it's not the special 0<->1 transition.
it's not -> it's ?

> +	 * See special logic around HPA_RANGE_NOT_FREE in tdx_pamt_get().
> +	 */
> +	if (!atomic_dec_and_test(pamt_refcount))
> +		return;
> +
>   	scoped_guard(spinlock, &pamt_lock) {
> -		if (!atomic_read(pamt_refcount))
> +		/* Lost race with tdx_pamt_get() */
> +		if (atomic_read(pamt_refcount))
>   			return;
>   
>   		tdx_status = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, pamt_pa_array);
>   
> -		if (IS_TDX_SUCCESS(tdx_status)) {
> -			atomic_dec(pamt_refcount);
> -		} else {
> +		if (!IS_TDX_SUCCESS(tdx_status)) {
> +			atomic_inc(pamt_refcount);
>   			pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
>   			return;
>   		}