[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86ab9923-624d-4950-abea-46780e94c6ce@linux.intel.com>
Date: Wed, 24 Sep 2025 14:15:11 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Rick Edgecombe <rick.p.edgecombe@...el.com>
Cc: kas@...nel.org, bp@...en8.de, chao.gao@...el.com,
dave.hansen@...ux.intel.com, isaku.yamahata@...el.com, kai.huang@...el.com,
kvm@...r.kernel.org, linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org, mingo@...hat.com, pbonzini@...hat.com,
seanjc@...gle.com, tglx@...utronix.de, x86@...nel.org, yan.y.zhao@...el.com,
vannapurve@...gle.com, "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH v3 08/16] x86/virt/tdx: Optimize tdx_alloc/free_page()
helpers
On 9/19/2025 7:22 AM, Rick Edgecombe wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
>
> Optimize the PAMT alloc/free helpers to avoid taking the global lock when
> possible.
>
> The recently introduced PAMT alloc/free helpers maintain a refcount to
> keep track of when it is ok to reclaim and free a 4KB PAMT page. This
> refcount is protected by a global lock in order to guarantee that races
> don’t result in the PAMT getting freed while another caller requests it
> be mapped. But a global lock is a bit heavyweight, especially since the
> refcounts can be (already are) updated atomically.
>
> A simple approach would be to increment/decrement the refcount outside of
> the lock before actually adjusting the PAMT, and only adjust the PAMT if
> the refcount transitions from/to 0. This would correctly allocate and free
> the PAMT page without getting out of sync. But there it leaves a race
> where a simultaneous caller could see the refcount already incremented and
> return before it is actually mapped.
>
> So treat the refcount 0->1 case as a special case. On add, if the refcount
> is zero *don’t* increment the refcount outside the lock (to 1). Always
> take the lock in that case and only set the refcount to 1 after the PAMT
> is actually added. This way simultaneous adders, when PAMT is not
> installed yet, will take the slow lock path.
>
> On the 1->0 case, it is ok to return from tdx_pamt_put() when the DPAMT is
> not actually freed yet, so the basic approach works. Just decrement the
> refcount before taking the lock. Only do the lock and removal of the PAMT
> when the refcount goes to zero.
>
> There is an asymmetry between tdx_pamt_get() and tdx_pamt_put() in that
> tdx_pamt_put() goes 1->0 outside the lock, but tdx_pamt_put() does 0-1
^
tdx_pamt_get() ?
> inside the lock. Because of this, there is a special race where
> tdx_pamt_put() could decrement the refcount to zero before the PAMT is
> actually removed, and tdx_pamt_get() could try to do a PAMT.ADD when the
> page is already mapped. Luckily the TDX module will tell return a special
> error that tells us we hit this case. So handle it specially by looking
> for the error code.
>
> The optimization is a little special, so make the code extra commented
> and verbose.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@...ux.intel.com>
> [Clean up code, update log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> ---
> v3:
> - Split out optimization from “x86/virt/tdx: Add tdx_alloc/free_page() helpers”
> - Remove edge case handling that I could not find a reason for
> - Write log
> ---
> arch/x86/include/asm/shared/tdx_errno.h | 2 ++
> arch/x86/virt/vmx/tdx/tdx.c | 46 +++++++++++++++++++++----
> 2 files changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/shared/tdx_errno.h b/arch/x86/include/asm/shared/tdx_errno.h
> index 49ab7ecc7d54..4bc0b9c9e82b 100644
> --- a/arch/x86/include/asm/shared/tdx_errno.h
> +++ b/arch/x86/include/asm/shared/tdx_errno.h
> @@ -21,6 +21,7 @@
> #define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL
> #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
> #define TDX_PAGE_METADATA_INCORRECT 0xC000030000000000ULL
> +#define TDX_HPA_RANGE_NOT_FREE 0xC000030400000000ULL
> #define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL
> #define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL
> #define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL
> @@ -100,6 +101,7 @@ DEFINE_TDX_ERRNO_HELPER(TDX_SUCCESS);
> DEFINE_TDX_ERRNO_HELPER(TDX_RND_NO_ENTROPY);
> DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_INVALID);
> DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_BUSY);
> +DEFINE_TDX_ERRNO_HELPER(TDX_HPA_RANGE_NOT_FREE);
> DEFINE_TDX_ERRNO_HELPER(TDX_VCPU_NOT_ASSOCIATED);
> DEFINE_TDX_ERRNO_HELPER(TDX_FLUSHVP_NOT_DONE);
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index af73b6c2e917..c25e238931a7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -2117,7 +2117,7 @@ int tdx_pamt_get(struct page *page)
> u64 pamt_pa_array[MAX_DPAMT_ARG_SIZE];
> atomic_t *pamt_refcount;
> u64 tdx_status;
> - int ret;
> + int ret = 0;
>
> if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> return 0;
> @@ -2128,14 +2128,40 @@ int tdx_pamt_get(struct page *page)
>
> pamt_refcount = tdx_find_pamt_refcount(hpa);
>
> + if (atomic_inc_not_zero(pamt_refcount))
> + goto out_free;
> +
> scoped_guard(spinlock, &pamt_lock) {
> - if (atomic_read(pamt_refcount))
> + /*
> + * Lost race to other tdx_pamt_add(). Other task has already allocated
> + * PAMT memory for the HPA.
> + */
> + if (atomic_read(pamt_refcount)) {
> + atomic_inc(pamt_refcount);
> goto out_free;
> + }
>
> tdx_status = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pa_array);
>
> if (IS_TDX_SUCCESS(tdx_status)) {
> + /*
> + * The refcount is zero, and this locked path is the only way to
> + * increase it from 0-1. If the PAMT.ADD was successful, set it
> + * to 1 (obviously).
> + */
> + atomic_set(pamt_refcount, 1);
> + } else if (IS_TDX_HPA_RANGE_NOT_FREE(tdx_status)) {
> + /*
> + * Less obviously, another CPU's call to tdx_pamt_put() could have
> + * decremented the refcount before entering its lock section.
> + * In this case, the PAMT is not actually removed yet. Luckily
> + * TDX module tells about this case, so increment the refcount
> + * 0-1, so tdx_pamt_put() skips its pending PAMT.REMOVE.
> + *
> + * The call didn't need the pages though, so free them.
> + */
> atomic_inc(pamt_refcount);
> + goto out_free;
> } else {
> pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status);
> goto out_free;
> @@ -2167,15 +2193,23 @@ void tdx_pamt_put(struct page *page)
>
> pamt_refcount = tdx_find_pamt_refcount(hpa);
>
> + /*
> + * Unlike the paired call in tdx_pamt_get(), decrement the refcount
> + * outside the lock even if it's not the special 0<->1 transition.
it's not -> it's ?
> + * See special logic around HPA_RANGE_NOT_FREE in tdx_pamt_get().
> + */
> + if (!atomic_dec_and_test(pamt_refcount))
> + return;
> +
> scoped_guard(spinlock, &pamt_lock) {
> - if (!atomic_read(pamt_refcount))
> + /* Lost race with tdx_pamt_get() */
> + if (atomic_read(pamt_refcount))
> return;
>
> tdx_status = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, pamt_pa_array);
>
> - if (IS_TDX_SUCCESS(tdx_status)) {
> - atomic_dec(pamt_refcount);
> - } else {
> + if (!IS_TDX_SUCCESS(tdx_status)) {
> + atomic_inc(pamt_refcount);
> pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status);
> return;
> }
Powered by blists - more mailing lists