[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pihazgqmsx4ltuvi2imgwsgvjsg2jsnxjnrdpxblwe2vc24opf@glsj2t3xosvb>
Date: Fri, 27 Jun 2025 14:27:48 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, dave.hansen@...ux.intel.com,
rick.p.edgecombe@...el.com, isaku.yamahata@...el.com, kai.huang@...el.com,
yan.y.zhao@...el.com, chao.gao@...el.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, kvm@...r.kernel.org, x86@...nel.org, linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org
Subject: Re: [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for
PAMT memory
On Wed, Jun 25, 2025 at 12:26:09PM -0700, Dave Hansen wrote:
> On 6/9/25 12:13, Kirill A. Shutemov wrote:
> > The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> > PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> > with a page pair that covers 2M of host physical memory.
> >
> > The kernel must provide this page pair before using pages from the range
> > for TDX. If this is not done, any SEAMCALL that attempts to use the
> > memory will fail.
> >
> > Allocate reference counters for every 2M range to track PAMT memory
> > usage. This is necessary to accurately determine when PAMT memory needs
> > to be allocated and when it can be freed.
> >
> > This allocation will consume 2MiB for every 1TiB of physical memory.
>
> ... and yes, this is another boot-time allocation that seems to be
> counter to the goal of reducing the boot-time TDX memory footprint.
>
> Please mention the 0.4%=>0.0004% overhead here in addition to the cover
> letter. It's important.
Okay.
> > Tracking PAMT memory usage on the kernel side duplicates what TDX module
> > does. It is possible to avoid this by lazily allocating PAMT memory on
> > SEAMCALL failure and freeing it based on hints provided by the TDX
> > module when the last user of PAMT memory is no longer present.
> >
> > However, this approach complicates serialization.
> >
> > The TDX module takes locks when dealing with PAMT: a shared lock on any
> > SEAMCALL that uses explicit HPA and an exclusive lock on PAMT.ADD and
> > PAMT.REMOVE. Any SEAMCALL that uses explicit HPA as an operand may fail
> > if it races with PAMT.ADD/REMOVE.
> >
> > Since PAMT is a global resource, to prevent failure the kernel would
> > need global locking (per-TD is not sufficient). Or, it has to retry on
> > TDX_OPERATOR_BUSY.
> >
> > Both options are not ideal, and tracking PAMT usage on the kernel side
> > seems like a reasonable alternative.
>
> Just a nit on changelog formatting: It would be ideal if you could make
> it totally clear that you are transitioning from "what this patch does"
> to "alternate considered designs".
Will do.
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -29,6 +29,7 @@
> > #include <linux/acpi.h>
> > #include <linux/suspend.h>
> > #include <linux/idr.h>
> > +#include <linux/vmalloc.h>
> > #include <asm/page.h>
> > #include <asm/special_insns.h>
> > #include <asm/msr-index.h>
> > @@ -50,6 +51,8 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> >
> > static struct tdmr_info_list tdx_tdmr_list;
> >
> > +static atomic_t *pamt_refcounts;
>
> Comments, please. How big is this? When is it allocated?
>
> In this case, it's even sparse, right? That's *SUPER* unusual for a
> kernel data structure.
Will do.
> > static enum tdx_module_status_t tdx_module_status;
> > static DEFINE_MUTEX(tdx_module_lock);
> >
> > @@ -182,6 +185,102 @@ int tdx_cpu_enable(void)
> > }
> > EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> >
> > +static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > +{
> > + return &pamt_refcounts[hpa / PMD_SIZE];
> > +}
>
> "get refcount" usually means "get a reference". This is looking up the
> location of the refcount.
>
> I think this needs a better name.
tdx_get_pamt_ref_ptr()?
> > +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> > +{
>
> This is getting to be severely under-commented.
>
> I also got this far into the patch and I'd forgotten about the sparse
> allocation and was scratching my head about what pte's have to do with
> dynamically allocating part of the PAMT.
>
> That point to a pretty severe deficit in the cover letter, changelogs
> and comments leading up to this point.
Ack.
> > + unsigned long vaddr;
> > + pte_t entry;
> > +
> > + if (!pte_none(ptep_get(pte)))
> > + return 0;
>
> This ^ is an optimization, right? Could it be comment appropriately, please?
Not optimization.
Calls of apply_to_page_range() can overlap by one page due to
round_up()/round_down() in alloc_pamt_refcount(). We don't need to
populate these pages again if they are already populated.
Will add a comment.
> > + vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> > + if (!vaddr)
> > + return -ENOMEM;
> > +
> > + entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> > +
> > + spin_lock(&init_mm.page_table_lock);
> > + if (pte_none(ptep_get(pte)))
> > + set_pte_at(&init_mm, addr, pte, entry);
> > + else
> > + free_page(vaddr);
> > + spin_unlock(&init_mm.page_table_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> > + void *data)
> > +{
> > + unsigned long vaddr;
> > +
> > + vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
>
> Gah, we really need a kpte_to_vaddr() helper here. This is really ugly.
> How many of these are in the tree?
I only found such chain in KASAN code.
What about this?
pte_t entry = ptep_get(pte);
struct page *page = pte_page(entry);
and use __free_page(page) instead free_page(vaddr)?
The similar thing can be don on allocation side.
>
> > + spin_lock(&init_mm.page_table_lock);
> > + if (!pte_none(ptep_get(pte))) {
>
> Is there really a case where this gets called on unpopulated ptes? How?
On error, we free metadata from the whole range that covers upto max_pfn.
There's no tracking which portion is populated.
> > + pte_clear(init_mm, addr, pte);
> > + free_page(vaddr);
> > + }
> > + spin_unlock(&init_mm.page_table_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int alloc_pamt_refcount(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + unsigned long start, end;
> > +
> > + start = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> > + end = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> > + start = round_down(start, PAGE_SIZE);
> > + end = round_up(end, PAGE_SIZE);
> > +
>
> Please try to vertically align these:
>
> start = (...)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> end = (...)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> start = round_down(start, PAGE_SIZE);
> end = round_up( end, PAGE_SIZE);
>
Okay.
> > + return apply_to_page_range(&init_mm, start, end - start,
> > + pamt_refcount_populate, NULL);
> > +}
>
> But, I've staring at these for maybe 5 minutes. I think I've made sense
> of it.
>
> alloc_pamt_refcount() is taking a relatively arbitrary range of pfns.
> Those PFNs come from memory map and NUMA layout so they don't have any
> real alignment guarantees.
>
> This code translates the memory range into a range of virtual addresses
> in the *virtual* refcount table. That table is sparse and might not be
> allocated. It is populated 4k at a time and since the start/end_pfn
> don't have any alignment guarantees, there's no telling onto which page
> they map into the refcount table. This has to be conservative and round
> 'start' down and 'end' up. This might overlap with previous refcount
> table populations.
>
> Is that all correct?
Yes.
> That seems ... medium to high complexity to me. Is there some reason
> none of it is documented or commented? Like, I think it's not been
> mentioned a single time anywhere.
I found it understandable when I wrote it, but it is misjudgement on my
part.
Will work on readability and comments.
> > +static int init_pamt_metadata(void)
> > +{
> > + size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> > + struct vm_struct *area;
> > +
> > + if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > + return 0;
> > +
> > + /*
> > + * Reserve vmalloc range for PAMT reference counters. It covers all
> > + * physical address space up to max_pfn. It is going to be populated
> > + * from init_tdmr() only for present memory that available for TDX use.
> > + */
> > + area = get_vm_area(size, VM_IOREMAP);
> > + if (!area)
> > + return -ENOMEM;
> > +
> > + pamt_refcounts = area->addr;
> > + return 0;
> > +}
> Finally, we get to a description of what's actually going on. But, still
> nothing has told me why this is necessary directly.
>
> If it were me, I'd probably split this up into two patches. The first
> would just do:
>
> area = vmalloc(size);
>
> The second would do all the fancy sparse population.
Makes sense.
> But either way, I've hit a wall on this. This is too impenetrable as it
> stands to review further. I'll eagerly await a more approachable v3.
Got it.
--
Kiryl Shutsemau / Kirill A. Shutemov
Powered by blists - more mailing lists