linux-kernel - Re: [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for PAMT memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pihazgqmsx4ltuvi2imgwsgvjsg2jsnxjnrdpxblwe2vc24opf@glsj2t3xosvb>
Date: Fri, 27 Jun 2025 14:27:48 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, dave.hansen@...ux.intel.com, 
	rick.p.edgecombe@...el.com, isaku.yamahata@...el.com, kai.huang@...el.com, 
	yan.y.zhao@...el.com, chao.gao@...el.com, tglx@...utronix.de, mingo@...hat.com, 
	bp@...en8.de, kvm@...r.kernel.org, x86@...nel.org, linux-coco@...ts.linux.dev, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHv2 03/12] x86/virt/tdx: Allocate reference counters for
 PAMT memory

On Wed, Jun 25, 2025 at 12:26:09PM -0700, Dave Hansen wrote:
> On 6/9/25 12:13, Kirill A. Shutemov wrote:
> > The PAMT memory holds metadata for TDX-protected memory. With Dynamic
> > PAMT, PAMT_4K is allocated on demand. The kernel supplies the TDX module
> > with a page pair that covers 2M of host physical memory.
> > 
> > The kernel must provide this page pair before using pages from the range
> > for TDX. If this is not done, any SEAMCALL that attempts to use the
> > memory will fail.
> > 
> > Allocate reference counters for every 2M range to track PAMT memory
> > usage. This is necessary to accurately determine when PAMT memory needs
> > to be allocated and when it can be freed.
> > 
> > This allocation will consume 2MiB for every 1TiB of physical memory.
> 
> ... and yes, this is another boot-time allocation that seems to be
> counter to the goal of reducing the boot-time TDX memory footprint.
> 
> Please mention the 0.4%=>0.0004% overhead here in addition to the cover
> letter. It's important.

Okay.

> > Tracking PAMT memory usage on the kernel side duplicates what TDX module
> > does.  It is possible to avoid this by lazily allocating PAMT memory on
> > SEAMCALL failure and freeing it based on hints provided by the TDX
> > module when the last user of PAMT memory is no longer present.
> > 
> > However, this approach complicates serialization.
> > 
> > The TDX module takes locks when dealing with PAMT: a shared lock on any
> > SEAMCALL that uses explicit HPA and an exclusive lock on PAMT.ADD and
> > PAMT.REMOVE. Any SEAMCALL that uses explicit HPA as an operand may fail
> > if it races with PAMT.ADD/REMOVE.
> > 
> > Since PAMT is a global resource, to prevent failure the kernel would
> > need global locking (per-TD is not sufficient). Or, it has to retry on
> > TDX_OPERATOR_BUSY.
> > 
> > Both options are not ideal, and tracking PAMT usage on the kernel side
> > seems like a reasonable alternative.
> 
> Just a nit on changelog formatting: It would be ideal if you could make
> it totally clear that you are transitioning from "what this patch does"
> to "alternate considered designs".

Will do.

> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/acpi.h>
> >  #include <linux/suspend.h>
> >  #include <linux/idr.h>
> > +#include <linux/vmalloc.h>
> >  #include <asm/page.h>
> >  #include <asm/special_insns.h>
> >  #include <asm/msr-index.h>
> > @@ -50,6 +51,8 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized);
> >  
> >  static struct tdmr_info_list tdx_tdmr_list;
> >  
> > +static atomic_t *pamt_refcounts;
> 
> Comments, please. How big is this? When is it allocated?
> 
> In this case, it's even sparse, right? That's *SUPER* unusual for a
> kernel data structure.

Will do.

> >  static enum tdx_module_status_t tdx_module_status;
> >  static DEFINE_MUTEX(tdx_module_lock);
> >  
> > @@ -182,6 +185,102 @@ int tdx_cpu_enable(void)
> >  }
> >  EXPORT_SYMBOL_GPL(tdx_cpu_enable);
> >  
> > +static atomic_t *tdx_get_pamt_refcount(unsigned long hpa)
> > +{
> > +	return &pamt_refcounts[hpa / PMD_SIZE];
> > +}
> 
> "get refcount" usually means "get a reference". This is looking up the
> location of the refcount.
> 
> I think this needs a better name.

tdx_get_pamt_ref_ptr()?

> > +static int pamt_refcount_populate(pte_t *pte, unsigned long addr, void *data)
> > +{
> 
> This is getting to be severely under-commented.
> 
> I also got this far into the patch and I'd forgotten about the sparse
> allocation and was scratching my head about what pte's have to do with
> dynamically allocating part of the PAMT.
> 
> That point to a pretty severe deficit in the cover letter, changelogs
> and comments leading up to this point.

Ack.

> > +	unsigned long vaddr;
> > +	pte_t entry;
> > +
> > +	if (!pte_none(ptep_get(pte)))
> > +		return 0;
> 
> This ^ is an optimization, right? Could it be comment appropriately, please?

Not optimization.

Calls of apply_to_page_range() can overlap by one page due to
round_up()/round_down() in alloc_pamt_refcount(). We don't need to
populate these pages again if they are already populated.

Will add a comment.

> > +	vaddr = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> > +	if (!vaddr)
> > +		return -ENOMEM;
> > +
> > +	entry = pfn_pte(PFN_DOWN(__pa(vaddr)), PAGE_KERNEL);
> > +
> > +	spin_lock(&init_mm.page_table_lock);
> > +	if (pte_none(ptep_get(pte)))
> > +		set_pte_at(&init_mm, addr, pte, entry);
> > +	else
> > +		free_page(vaddr);
> > +	spin_unlock(&init_mm.page_table_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +static int pamt_refcount_depopulate(pte_t *pte, unsigned long addr,
> > +				    void *data)
> > +{
> > +	unsigned long vaddr;
> > +
> > +	vaddr = (unsigned long)__va(PFN_PHYS(pte_pfn(ptep_get(pte))));
> 
> Gah, we really need a kpte_to_vaddr() helper here. This is really ugly.
> How many of these are in the tree?

I only found such chain in KASAN code.

What about this?

      pte_t entry = ptep_get(pte);
      struct page *page = pte_page(entry);

and use __free_page(page) instead free_page(vaddr)?

The similar thing can be don on allocation side.

> 
> > +	spin_lock(&init_mm.page_table_lock);
> > +	if (!pte_none(ptep_get(pte))) {
> 
> Is there really a case where this gets called on unpopulated ptes? How?

On error, we free metadata from the whole range that covers upto max_pfn.
There's no tracking which portion is populated.

> > +		pte_clear(init_mm, addr, pte);
> > +		free_page(vaddr);
> > +	}
> > +	spin_unlock(&init_mm.page_table_lock);
> > +
> > +	return 0;
> > +}
> > +
> > +static int alloc_pamt_refcount(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > +	unsigned long start, end;
> > +
> > +	start = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> > +	end = (unsigned long)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> > +	start = round_down(start, PAGE_SIZE);
> > +	end = round_up(end, PAGE_SIZE);
> > +
> 
> Please try to vertically align these:
> 
> 	start = (...)tdx_get_pamt_refcount(PFN_PHYS(start_pfn));
> 	end   = (...)tdx_get_pamt_refcount(PFN_PHYS(end_pfn + 1));
> 	start = round_down(start, PAGE_SIZE);
> 	end   = round_up(    end, PAGE_SIZE);
> 

Okay.

> > +	return apply_to_page_range(&init_mm, start, end - start,
> > +				   pamt_refcount_populate, NULL);
> > +}
> 
> But, I've staring at these for maybe 5 minutes. I think I've made sense
> of it.
> 
> alloc_pamt_refcount() is taking a relatively arbitrary range of pfns.
> Those PFNs come from memory map and NUMA layout so they don't have any
> real alignment guarantees.
> 
> This code translates the memory range into a range of virtual addresses
> in the *virtual* refcount table. That table is sparse and might not be
> allocated. It is populated 4k at a time and since the start/end_pfn
> don't have any alignment guarantees, there's no telling onto which page
> they map into the refcount table. This has to be conservative and round
> 'start' down and 'end' up. This might overlap with previous refcount
> table populations.
> 
> Is that all correct?

Yes.

> That seems ... medium to high complexity to me. Is there some reason
> none of it is documented or commented? Like, I think it's not been
> mentioned a single time anywhere.

I found it understandable when I wrote it, but it is misjudgement on my
part.

Will work on readability and comments.

> > +static int init_pamt_metadata(void)
> > +{
> > +	size_t size = max_pfn / PTRS_PER_PTE * sizeof(*pamt_refcounts);
> > +	struct vm_struct *area;
> > +
> > +	if (!tdx_supports_dynamic_pamt(&tdx_sysinfo))
> > +		return 0;
> > +
> > +	/*
> > +	 * Reserve vmalloc range for PAMT reference counters. It covers all
> > +	 * physical address space up to max_pfn. It is going to be populated
> > +	 * from init_tdmr() only for present memory that available for TDX use.
> > +	 */
> > +	area = get_vm_area(size, VM_IOREMAP);
> > +	if (!area)
> > +		return -ENOMEM;
> > +
> > +	pamt_refcounts = area->addr;
> > +	return 0;
> > +}
> Finally, we get to a description of what's actually going on. But, still
> nothing has told me why this is necessary directly.
> 
> If it were me, I'd probably split this up into two patches. The first
> would just do:
> 
> 	area = vmalloc(size);
> 
> The second would do all the fancy sparse population.

Makes sense.

> But either way, I've hit a wall on this. This is too impenetrable as it
> stands to review further. I'll eagerly await a more approachable v3.

Got it.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov