[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb5addcb-1cfc-45be-978c-e7cee4126b38@intel.com>
Date: Wed, 25 Jun 2025 12:26:09 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
pbonzini@...hat.com, seanjc@...gle.com, dave.hansen@...ux.intel.com
Cc: rick.p.edgecombe@...el.com, isaku.yamahata@...el.com,
kai.huang@...el.com, yan.y.zhao@...el.com, chao.gao@...el.com,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, kvm@...r.kernel.org,
x86@...nel.org, linux-coco@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for
PAMT memory
On 6/9/25 12:13, Kirill A. Shutemov wrote:
> The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> with a page pair that covers 2M of host physical memory.
>
> The kernel must provide this page pair before using pages from the range
> for TDX. If this is not done, any SEAMCALL that attempts to use the
> memory will fail.
>
> Allocate reference counters for every 2M range to track PAMT memory
> usage. This is necessary to accurately determine when PAMT memory needs
> to be allocated and when it can be freed.
>
> This allocation will consume 2MiB for every 1TiB of physical memory.
... and yes, this is another boot-time allocation that seems to be
counter to the goal of reducing the boot-time TDX memory footprint.
Please mention the 0.4%=>0.0004% overhead here in addition to the cover
letter. It's important.
> Tracking PAMT memory usage on the kernel side duplicates what TDX module
> does. It is possible to avoid this by lazily allocating PAMT memory on
> SEAMCALL failure and freeing it based on hints provided by the TDX
> module when the last user of PAMT memory is no longer present.
>
> However, this approach complicates serialization.
>
> The TDX module takes locks when dealing with PAMT: a shared lock on any
> SEAMCALL that uses explicit HPA and an exclusive lock on PAMT.ADD and
> PAMT.REMOVE. Any SEAMCALL that uses explicit HPA as an operand may fail
> if it races with PAMT.ADD/REMOVE.
>
> Since PAMT is a global resource, to prevent failure the kernel would
> need global locking (per-TD is not sufficient). Or, it has to retry on
> TDX_OPERATOR_BUSY.
>
> Both options are not ideal, and tracking PAMT usage on the kernel side
> seems like a reasonable alternative.
Just a nit on changelog formatting: It would be ideal if you could make
it totally clear that you are transitioning from "what this patch does"
to "alternate considered designs".
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -29,6 +29,7 @@
> #include <linux/acpi.h>
> #include <linux/suspend.h>
> #include <linux/idr.h>
> +#include <linux/vmalloc.h>
> #include <asm/page.h>
> #include <asm/special_insns.h>
> #include <asm/msr-index.h>
> @@ -50,6 +51,8 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
>
> static struct tdmr_info_list tdx_tdmr_list;
>
> +static atomic_t *pamt_refcounts;
Comments, please. How big is this? When is it allocated?
In this case, it's even sparse, right? That's *SUPER* unusual for a
kernel data structure.
> static enum tdx_module_status_t tdx_module_status;
> static DEFINE_MUTEX(tdx_module_lock);
>
> @@ -182,6 +185,102 @@ int tdx_cpu_enable(void)
> }
> EXPORT_SYMBOL_GPL(tdx_cpu_enable);
>
> +static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> +{
> + return &pamt_refcounts[hpa / PMD_SIZE];
> +}
"get refcount" usually means "get a reference". This is looking up the
location of the refcount.
I think this needs a better name.
> +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> +{
This is getting to be severely under-commented.
I also got this far into the patch and I'd forgotten about the sparse
allocation and was scratching my head about what pte's have to do with
dynamically allocating part of the PAMT.
That point to a pretty severe deficit in the cover letter, changelogs
and comments leading up to this point.
> + unsigned long vaddr;
> + pte_t entry;
> +
> + if (!pte_none(ptep_get(pte)))
> + return 0;
This ^ is an optimization, right? Could it be comment appropriately, please?
> + vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> + if (!vaddr)
> + return -ENOMEM;
> +
> + entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> +
> + spin_lock(&init_mm.page_table_lock);
> + if (pte_none(ptep_get(pte)))
> + set_pte_at(&init_mm, addr, pte, entry);
> + else
> + free_page(vaddr);
> + spin_unlock(&init_mm.page_table_lock);
> +
> + return 0;
> +}
> +
> +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> + void *data)
> +{
> + unsigned long vaddr;
> +
> + vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
Gah, we really need a kpte_to_vaddr() helper here. This is really ugly.
How many of these are in the tree?
> + spin_lock(&init_mm.page_table_lock);
> + if (!pte_none(ptep_get(pte))) {
Is there really a case where this gets called on unpopulated ptes? How?
> + pte_clear(&init_mm, addr, pte);
> + free_page(vaddr);
> + }
> + spin_unlock(&init_mm.page_table_lock);
> +
> + return 0;
> +}
> +
> +static int alloc_pamt_refcount(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long start, end;
> +
> + start = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> + end = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> + start = round_down(start, PAGE_SIZE);
> + end = round_up(end, PAGE_SIZE);
> +
Please try to vertically align these:
start = (...)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
end = (...)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
start = round_down(start, PAGE_SIZE);
end = round_up( end, PAGE_SIZE);
> + return apply_to_page_range(&init_mm, start, end - start,
> + pamt_refcount_populate, NULL);
> +}
But, I've staring at these for maybe 5 minutes. I think I've made sense
of it.
alloc_pamt_refcount() is taking a relatively arbitrary range of pfns.
Those PFNs come from memory map and NUMA layout so they don't have any
real alignment guarantees.
This code translates the memory range into a range of virtual addresses
in the *virtual* refcount table. That table is sparse and might not be
allocated. It is populated 4k at a time and since the start/end_pfn
don't have any alignment guarantees, there's no telling onto which page
they map into the refcount table. This has to be conservative and round
'start' down and 'end' up. This might overlap with previous refcount
table populations.
Is that all correct?
That seems ... medium to high complexity to me. Is there some reason
none of it is documented or commented? Like, I think it's not been
mentioned a single time anywhere.
> +static int init_pamt_metadata(void)
> +{
> + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> + struct vm_struct *area;
> +
> + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> + return 0;
> +
> + /*
> + * Reserve vmalloc range for PAMT reference counters. It covers all
> + * physical address space up to max_pfn. It is going to be populated
> + * from init_tdmr() only for present memory that available for TDX use.
> + */
> + area = get_vm_area(size, VM_IOREMAP);
> + if (!area)
> + return -ENOMEM;
> +
> + pamt_refcounts = area->addr;
> + return 0;
> +}
Finally, we get to a description of what's actually going on. But, still
nothing has told me why this is necessary directly.
If it were me, I'd probably split this up into two patches. The first
would just do:
area = vmalloc(size);
The second would do all the fancy sparse population.
But either way, I've hit a wall on this. This is too impenetrable as it
stands to review further. I'll eagerly await a more approachable v3.
Powered by blists - more mailing lists