linux-kernel - Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <af603d66512ec5dca0c240cf81c83de7dfe730e7.camel@intel.com>
Date:   Mon, 02 May 2022 17:59:55 +1200
From:   Kai Huang <kai.huang@...el.com>
To:     Dave Hansen <dave.hansen@...el.com>, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org
Cc:     seanjc@...gle.com, pbonzini@...hat.com, len.brown@...el.com,
        tony.luck@...el.com, rafael.j.wysocki@...el.com,
        reinette.chatre@...el.com, dan.j.williams@...el.com,
        peterz@...radead.org, ak@...ux.intel.com,
        kirill.shutemov@...ux.intel.com,
        sathyanarayanan.kuppuswamy@...ux.intel.com,
        isaku.yamahata@...el.com
Subject: Re: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for
 TDMRs

On Fri, 2022-04-29 at 07:20 -0700, Dave Hansen wrote:
> On 4/29/22 00:46, Kai Huang wrote:
> > On Thu, 2022-04-28 at 10:12 -0700, Dave Hansen wrote:
> > > This is also a good place to note the downsides of using
> > > alloc_contig_pages().
> > 
> > For instance:
> > 
> > 	The allocation may fail when memory usage is under pressure.
> 
> It's not really memory pressure, though.  The larger the allocation, the
> more likely it is to fail.  The more likely it is that the kernel can't
> free the memory or that if you need 1GB of contiguous memory that
> 999.996MB gets freed, but there is one stubborn page left.
> 
> alloc_contig_pages() can and will fail.  The only mitigation which is
> guaranteed to avoid this is doing the allocation at boot.  But, you're
> not doing that to avoid wasting memory on every TDX system that doesn't
> use TDX.
> 
> A *good* way (although not foolproof) is to launch a TDX VM early in
> boot before memory gets fragmented or consumed.  You might even want to
> recommend this in the documentation.

"launch a TDX VM early in boot" I suppose you mean having some boot-time service
which launches a TDX VM before we get the login interface.  I'll put this in the
documentation.

How about adding below in the changelog:

"
However using alloc_contig_pages() to allocate large physically contiguous
memory at runtime may fail.  The larger the allocation, the more likely it is to
fail.  Due to the fragmentation, the kernel may need to move pages out of the
to-be-allocated contiguous memory range but it may fail to move even the last
stubborn page.  A good way (although not foolproof) is to launch a TD VM early
in boot to get PAMTs allocated before memory gets fragmented or consumed.
"

> 
> > > > +/*
> > > > + * Locate the NUMA node containing the start of the given TDMR's first
> > > > + * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
> > > > + */
> > > 
> > > Please add a sentence or two on the implications here of what this means
> > > when it happens.  Also, the joining of e820 regions seems like it might
> > > span NUMA nodes.  What prevents that code from just creating one large
> > > e820 area that leads to one large TDMR and horrible NUMA affinity for
> > > these structures?
> > 
> > How about adding:
> > 
> > 	When TDMR is created, it stops spanning at NUAM boundary.
> 
> I actually don't know what that means at all.  I was thinking of
> something like this.
> 
> /*
>  * Pick a NUMA node on which to allocate this TDMR's metadata.
>  *
>  * This is imprecise since TDMRs are 1GB aligned and NUMA nodes might
>  * not be.  If the TDMR covers more than one node, just use the _first_
>  * one.  This can lead to small areas of off-node metadata for some
>  * memory.
>  */

Thanks.

> 
> > > > +static int tdmr_get_nid(struct tdmr_info *tdmr)
> > > > +{
> > > > +	u64 start, end;
> > > > +	int i;
> > > > +
> > > > +	/* Find the first RAM entry covered by the TDMR */
> 
> There's something else missing in here.  Why not just do:
> 
> 	return phys_to_target_node(TDMR_START(tdmr));
> 
> This would explain it:
> 
> 	/*
> 	 * The beginning of the TDMR might not point to RAM.
> 	 * Find its first RAM address which which its node can
> 	 * be found.
> 	 */

Will use this.  Thanks.

> 
> > > > +	e820_for_each_mem(i, start, end)
> > > > +		if (end > TDMR_START(tdmr))
> > > > +			break;
> > > 
> > > Brackets around the big loop, please.
> > 
> > OK.
> > 
> > > 
> > > > +	/*
> > > > +	 * One TDMR must cover at least one (or partial) RAM entry,
> > > > +	 * otherwise it is kernel bug.  WARN_ON() in this case.
> > > > +	 */
> > > > +	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
> > > > +		return 0;
> 
> This really means "no RAM found for this TDMR", right?  Can we say that,
> please.

OK will add it.  How about:

	/*
	 * No RAM found for this TDMR.  WARN() in this case, as it
	 * cannot happen otherwise it is a kernel bug.
	 */

> 
> 
> > > > +	/*
> > > > +	 * Allocate one chunk of physically contiguous memory for all
> > > > +	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
> > > > +	 * in overlapped TDMRs.
> > > > +	 */
> > > 
> > > Ahh, this explains it.  Considering that tdmr_get_pamt_sz() is really
> > > just two lines of code, I'd probably just the helper and open-code it
> > > here.  Then you only have one place to comment on it.
> > 
> > It has a loop and internally calls __tdmr_get_pamt_sz().  It looks doesn't fit
> > if we open-code it here.
> > 
> > How about move this comment to tdmr_get_pamt_sz()?
> 
> I thought about that.  But tdmr_get_pamt_sz() isn't itself doing any
> allocation so it doesn't make a whole lot of logical sense.  This is a
> place where a helper _can_ be removed.  Remove it, please.

OK.  Will remove the helper.  Thanks.

-- 
Thanks,
-Kai