[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4b30520d-f3fa-4806-9d58-176adb8791a6@intel.com>
Date: Fri, 30 Aug 2024 13:50:06 +0300
From: Adrian Hunter <adrian.hunter@...el.com>
To: Kai Huang <kai.huang@...el.com>, dave.hansen@...el.com,
kirill.shutemov@...ux.intel.com, tglx@...utronix.de, bp@...en8.de,
peterz@...radead.org, mingo@...hat.com, hpa@...or.com,
dan.j.williams@...el.com, seanjc@...gle.com, pbonzini@...hat.com
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
rick.p.edgecombe@...el.com, isaku.yamahata@...el.com, chao.gao@...el.com,
binbin.wu@...ux.intel.com
Subject: Re: [PATCH v3 7/8] x86/virt/tdx: Reduce TDMR's reserved areas by
using CMRs to find memory holes
On 27/08/24 10:14, Kai Huang wrote:
> A TDX module initialization failure was reported on a Emerald Rapids
> platform:
>
> virt/tdx: initialization failed: TDMR [0x0, 0x80000000): reserved areas exhausted.
> virt/tdx: module initialization failed (-28)
>
> As part of initializing the TDX module, the kernel informs the TDX
> module of all "TDX-usable memory regions" using an array of TDX defined
> structure "TD Memory Region" (TDMR). Each TDMR must be in 1GB aligned
> and in 1GB granularity, and all "non-TDX-usable memory holes" within a
> given TDMR must be marked as "reserved areas". The TDX module reports a
> maximum number of reserved areas that can be supported per TDMR.
The statement:
... all "non-TDX-usable memory holes" within a
given TDMR must be marked as "reserved areas".
is not exactly true, which is essentially the basis of this fix.
The relevant requirements are (from the spec):
Any non-reserved 4KB page within a TDMR must be convertible
i.e., it must be within a CMR
Reserved areas within a TDMR need not be within a CMR.
PAMT areas must not overlap with TDMR non-reserved areas;
however, they may reside within TDMR reserved areas
(as long as these are convertible).
>
> Currently, the kernel finds those "non-TDX-usable memory holes" within a
> given TDMR by walking over a list of "TDX-usable memory regions", which
> essentially reflects the "usable" regions in the e820 table (w/o memory
> hotplug operations precisely, but this is not relevant here).
But including e820 table regions that are not "usable" in the TDMR
reserved areas is not necessary - it is not one of the rules.
What confused me initially was that I did not realize the we already
require that the TDX Module does not touch memory in the TDMR
non-reserved areas not specifically allocated to it. So it makes no
difference to the TDX Module what the pages that have not been allocated
to it, are used for.
>
> As shown above, the root cause of this failure is when the kernel tries
> to construct a TDMR to cover address range [0x0, 0x80000000), there
> are too many memory holes within that range and the number of memory
> holes exceeds the maximum number of reserved areas.
>
> The E820 table of that platform (see [1] below) reflects this: the
> number of memory holes among e820 "usable" entries exceeds 16, which is
> the maximum number of reserved areas TDX module supports in practice.
>
> === Fix ===
>
> There are two options to fix this: 1) reduce the number of memory holes
> when constructing a TDMR to save "reserved areas"; 2) reduce the TDMR's
> size to cover fewer memory regions, thus fewer memory holes.
Probably better to try and get rid of this "two options" stuff and focus
on how this is a simple and effective fix.
>
> Option 1) is possible, and in fact is easier and preferable:
>
> TDX actually has a concept of "Convertible Memory Regions" (CMRs). TDX
> reports a list of CMRs that meet TDX's security requirements on memory.
> TDX requires all the "TDX-usable memory regions" that the kernel passes
> to the module via TDMRs, a.k.a, all the "non-reserved regions in TDMRs",
> must be convertible memory.
>
> In other words, if a memory hole is indeed CMR, then it's not mandatory
> for the kernel to add it to the reserved areas. By doing so, the number
> of consumed reserved areas can be reduced w/o having any functional
> impact. The kernel still allocates TDX memory from the page allocator.
> There's no harm if the kernel tells the TDX module some memory regions
> are "TDX-usable" but they will never be allocated by the kernel as TDX
> memory.
>
> Note this doesn't have any security impact either because the kernel is
> out of TDX's TCB anyway.
>
> This is feasible because in practice the CMRs just reflect the nature of
> whether the RAM can indeed be used by TDX, thus each CMR tends to be a
> large, uninterrupted range of memory, i.e., unlike the e820 table which
> contains numerous "ACPI *" entries in the first 2G range. Refer to [2]
> for CMRs reported on the problematic platform using off-tree TDX code.
>
> So for this particular module initialization failure, the memory holes
> that are within [0x0, 0x80000000) are mostly indeed CMR. By not adding
> them to the reserved areas, the number of consumed reserved areas for
> the TDMR [0x0, 0x80000000) can be dramatically reduced.
>
> Option 2) is also theoretically feasible, but it is not desired:
>
> It requires more complicated logic to handle splitting TDMR into smaller
> ones, which isn't trivial. There are limitations to splitting TDMR too,
> thus it may not always work: 1) The smallest TDMR is 1GB, and it cannot
> be split any further; 2) This also increases the total number of TDMRs,
> which also has a maximum value limited by the TDX module.
>
> So, fix this issue by using option 1):
>
> 1) reading out the CMRs from the TDX module global metadata, and
> 2) changing to find memory holes for a given TDMR based on CMRs, but not
> based on the list of "TDX-usable memory regions".
>
> Also dump the CMRs in dmesg. They are helpful when something goes wrong
> around "constructing the TDMRs and configuring the TDX module with
> them". Note there are no existing userspace tools that the user can get
> CMRs since they can only be read via SEAMCALL (no CPUID, MSR etc).
>
> [1] BIOS-E820 table of the problematic platform:
>
> BIOS-e820: [mem 0x0000000000000000-0x000000000009efff] usable
> BIOS-e820: [mem 0x000000000009f000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x000000005d168fff] usable
> BIOS-e820: [mem 0x000000005d169000-0x000000005d22afff] ACPI data
> BIOS-e820: [mem 0x000000005d22b000-0x000000005d3cefff] usable
> BIOS-e820: [mem 0x000000005d3cf000-0x000000005d469fff] reserved
> BIOS-e820: [mem 0x000000005d46a000-0x000000005e5b2fff] usable
> BIOS-e820: [mem 0x000000005e5b3000-0x000000005e5c2fff] reserved
> BIOS-e820: [mem 0x000000005e5c3000-0x000000005e5d2fff] usable
> BIOS-e820: [mem 0x000000005e5d3000-0x000000005e5e4fff] reserved
> BIOS-e820: [mem 0x000000005e5e5000-0x000000005eb57fff] usable
> BIOS-e820: [mem 0x000000005eb58000-0x0000000061357fff] ACPI NVS
> BIOS-e820: [mem 0x0000000061358000-0x000000006172afff] usable
> BIOS-e820: [mem 0x000000006172b000-0x0000000061794fff] ACPI data
> BIOS-e820: [mem 0x0000000061795000-0x00000000617fefff] usable
> BIOS-e820: [mem 0x00000000617ff000-0x0000000061912fff] ACPI data
> BIOS-e820: [mem 0x0000000061913000-0x0000000061998fff] usable
> BIOS-e820: [mem 0x0000000061999000-0x00000000619dffff] ACPI data
> BIOS-e820: [mem 0x00000000619e0000-0x00000000619e1fff] usable
> BIOS-e820: [mem 0x00000000619e2000-0x00000000619e9fff] reserved
> BIOS-e820: [mem 0x00000000619ea000-0x0000000061a26fff] usable
> BIOS-e820: [mem 0x0000000061a27000-0x0000000061baefff] ACPI data
> BIOS-e820: [mem 0x0000000061baf000-0x00000000623c2fff] usable
> BIOS-e820: [mem 0x00000000623c3000-0x0000000062471fff] reserved
> BIOS-e820: [mem 0x0000000062472000-0x0000000062823fff] usable
> BIOS-e820: [mem 0x0000000062824000-0x0000000063a24fff] reserved
> BIOS-e820: [mem 0x0000000063a25000-0x0000000063d57fff] usable
> BIOS-e820: [mem 0x0000000063d58000-0x0000000064157fff] reserved
> BIOS-e820: [mem 0x0000000064158000-0x0000000064158fff] usable
> BIOS-e820: [mem 0x0000000064159000-0x0000000064194fff] reserved
> BIOS-e820: [mem 0x0000000064195000-0x000000006e9cefff] usable
> BIOS-e820: [mem 0x000000006e9cf000-0x000000006eccefff] reserved
> BIOS-e820: [mem 0x000000006eccf000-0x000000006f6fefff] ACPI NVS
> BIOS-e820: [mem 0x000000006f6ff000-0x000000006f7fefff] ACPI data
> BIOS-e820: [mem 0x000000006f7ff000-0x000000006f7fffff] usable
> BIOS-e820: [mem 0x000000006f800000-0x000000008fffffff] reserved
> ......
>
> [2] Convertible Memory Regions of the problematic platform:
>
> virt/tdx: CMR: [0x100000, 0x6f800000)
> virt/tdx: CMR: [0x100000000, 0x107a000000)
> virt/tdx: CMR: [0x1080000000, 0x207c000000)
> virt/tdx: CMR: [0x2080000000, 0x307c000000)
> virt/tdx: CMR: [0x3080000000, 0x407c000000)
>
> Fixes: dde3b60d572c9 ("x86/virt/tdx: Designate reserved areas for all TDMRs")
> Signed-off-by: Kai Huang <kai.huang@...el.com>
> ---
>
> v2 -> v3:
>
> - Add the Fixes tag, although this patch depends on previous patches.
> - CMR_BASE0 -> CMR_BASE(_i), CMR_SIZE0 -> CMR_SIZE(_i) to silence the
> build-check error.
>
> v1 -> v2:
> - Change to walk over CMRs directly to find out memory holes, instead
> of walking over TDX memory blocks and explicitly check whether a hole
> is subregion of CMR. (Chao)
> - Mention any constant macro definitions in global metadata structures
> are TDX architectural. (Binbin)
> - Slightly improve the changelog.
>
> ---
> arch/x86/virt/vmx/tdx/tdx.c | 105 ++++++++++++++++++++++++++++++------
> arch/x86/virt/vmx/tdx/tdx.h | 13 +++++
> 2 files changed, 102 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 0fb673dd43ed..fa335ab1ae92 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -331,6 +331,72 @@ static int get_tdx_sys_info_version(struct tdx_sys_info_version *sysinfo_version
> return ret;
> }
>
> +/* Update the @sysinfo_cmr->num_cmrs to trim tail empty CMRs */
> +static void trim_empty_tail_cmrs(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> + int i;
> +
> + for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> + u64 cmr_base = sysinfo_cmr->cmr_base[i];
> + u64 cmr_size = sysinfo_cmr->cmr_size[i];
> +
> + if (!cmr_size) {
> + WARN_ON_ONCE(cmr_base);
> + break;
> + }
> +
> + /* TDX architecture: CMR must be 4KB aligned */
> + WARN_ON_ONCE(!PAGE_ALIGNED(cmr_base) ||
> + !PAGE_ALIGNED(cmr_size));
> + }
> +
> + sysinfo_cmr->num_cmrs = i;
> +}
> +
> +static int get_tdx_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> + int i, ret = 0;
> +
> +#define TD_SYSINFO_MAP_CMR_INFO(_field_id, _member) \
> + TD_SYSINFO_MAP(_field_id, sysinfo_cmr, _member)
> +
> + TD_SYSINFO_MAP_CMR_INFO(NUM_CMRS, num_cmrs);
> +
> + if (ret)
> + return ret;
> +
> + for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> + TD_SYSINFO_MAP_CMR_INFO(CMR_BASE(i), cmr_base[i]);
> + TD_SYSINFO_MAP_CMR_INFO(CMR_SIZE(i), cmr_size[i]);
> + }
> +
> + if (ret)
> + return ret;
> +
> + /*
> + * The TDX module may just report the maximum number of CMRs that
> + * TDX architecturally supports as the actual number of CMRs,
> + * despite the latter is smaller. In this case all the tail
> + * CMRs will be empty. Trim them away.
> + */
> + trim_empty_tail_cmrs(sysinfo_cmr);
> +
> + return 0;
> +}
> +
> +static void print_sys_info_cmr(struct tdx_sys_info_cmr *sysinfo_cmr)
> +{
> + int i;
> +
> + for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> + u64 cmr_base = sysinfo_cmr->cmr_base[i];
> + u64 cmr_size = sysinfo_cmr->cmr_size[i];
> +
> + pr_info("CMR[%d]: [0x%llx, 0x%llx)\n", i, cmr_base,
> + cmr_base + cmr_size);
> + }
> +}
> +
> static void print_basic_sys_info(struct tdx_sys_info *sysinfo)
> {
> struct tdx_sys_info_features *features = &sysinfo->features;
> @@ -349,6 +415,8 @@ static void print_basic_sys_info(struct tdx_sys_info *sysinfo)
> version->major, version->minor, version->update,
> version->internal, version->build_num,
> version->build_date, features->tdx_features0);
> +
> + print_sys_info_cmr(&sysinfo->cmr);
> }
>
> static int get_tdx_sys_info_tdmr(struct tdx_sys_info_tdmr *sysinfo_tdmr)
> @@ -379,6 +447,10 @@ static int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
> if (ret)
> return ret;
>
> + ret = get_tdx_sys_info_cmr(&sysinfo->cmr);
> + if (ret)
> + return ret;
> +
> return get_tdx_sys_info_tdmr(&sysinfo->tdmr);
> }
>
> @@ -803,29 +875,28 @@ static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr,
> }
>
> /*
> - * Go through @tmb_list to find holes between memory areas. If any of
> + * Go through all CMRs in @sysinfo_cmr to find memory holes. If any of
> * those holes fall within @tdmr, set up a TDMR reserved area to cover
> * the hole.
> */
> -static int tdmr_populate_rsvd_holes(struct list_head *tmb_list,
> +static int tdmr_populate_rsvd_holes(struct tdx_sys_info_cmr *sysinfo_cmr,
> struct tdmr_info *tdmr,
> int *rsvd_idx,
> u16 max_reserved_per_tdmr)
> {
> - struct tdx_memblock *tmb;
> u64 prev_end;
> - int ret;
> + int i, ret;
>
> /*
> * Start looking for reserved blocks at the
> * beginning of the TDMR.
> */
> prev_end = tdmr->base;
> - list_for_each_entry(tmb, tmb_list, list) {
> + for (i = 0; i < sysinfo_cmr->num_cmrs; i++) {
> u64 start, end;
>
> - start = PFN_PHYS(tmb->start_pfn);
> - end = PFN_PHYS(tmb->end_pfn);
> + start = sysinfo_cmr->cmr_base[i];
> + end = start + sysinfo_cmr->cmr_size[i];
>
> /* Break if this region is after the TDMR */
> if (start >= tdmr_end(tdmr))
> @@ -926,16 +997,16 @@ static int rsvd_area_cmp_func(const void *a, const void *b)
>
> /*
> * Populate reserved areas for the given @tdmr, including memory holes
> - * (via @tmb_list) and PAMTs (via @tdmr_list).
> + * (via @sysinfo_cmr) and PAMTs (via @tdmr_list).
> */
> static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
> - struct list_head *tmb_list,
> + struct tdx_sys_info_cmr *sysinfo_cmr,
> struct tdmr_info_list *tdmr_list,
> u16 max_reserved_per_tdmr)
> {
> int ret, rsvd_idx = 0;
>
> - ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx,
> + ret = tdmr_populate_rsvd_holes(sysinfo_cmr, tdmr, &rsvd_idx,
> max_reserved_per_tdmr);
> if (ret)
> return ret;
> @@ -954,10 +1025,10 @@ static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr,
>
> /*
> * Populate reserved areas for all TDMRs in @tdmr_list, including memory
> - * holes (via @tmb_list) and PAMTs.
> + * holes (via @sysinfo_cmr) and PAMTs.
> */
> static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
> - struct list_head *tmb_list,
> + struct tdx_sys_info_cmr *sysinfo_cmr,
> u16 max_reserved_per_tdmr)
> {
> int i;
> @@ -966,7 +1037,7 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
> int ret;
>
> ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i),
> - tmb_list, tdmr_list, max_reserved_per_tdmr);
> + sysinfo_cmr, tdmr_list, max_reserved_per_tdmr);
> if (ret)
> return ret;
> }
> @@ -981,7 +1052,8 @@ static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list,
> */
> static int construct_tdmrs(struct list_head *tmb_list,
> struct tdmr_info_list *tdmr_list,
> - struct tdx_sys_info_tdmr *sysinfo_tdmr)
> + struct tdx_sys_info_tdmr *sysinfo_tdmr,
> + struct tdx_sys_info_cmr *sysinfo_cmr)
> {
> int ret;
>
> @@ -994,7 +1066,7 @@ static int construct_tdmrs(struct list_head *tmb_list,
> if (ret)
> return ret;
>
> - ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list,
> + ret = tdmrs_populate_rsvd_areas_all(tdmr_list, sysinfo_cmr,
> sysinfo_tdmr->max_reserved_per_tdmr);
> if (ret)
> tdmrs_free_pamt_all(tdmr_list);
> @@ -1185,7 +1257,8 @@ static int init_tdx_module(void)
> goto err_free_tdxmem;
>
> /* Cover all TDX-usable memory regions in TDMRs */
> - ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr);
> + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, &sysinfo.tdmr,
> + &sysinfo.cmr);
> if (ret)
> goto err_free_tdmrs;
>
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index b422e8517e01..e7bed9e717c7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -40,6 +40,10 @@
> #define MD_FIELD_ID_UPDATE_VERSION 0x0800000100000005ULL
> #define MD_FIELD_ID_INTERNAL_VERSION 0x0800000100000006ULL
>
> +#define MD_FIELD_ID_NUM_CMRS 0x9000000100000000ULL
> +#define MD_FIELD_ID_CMR_BASE(_i) (0x9000000300000080ULL + (u16)_i)
> +#define MD_FIELD_ID_CMR_SIZE(_i) (0x9000000300000100ULL + (u16)_i)
> +
> #define MD_FIELD_ID_MAX_TDMRS 0x9100000100000008ULL
> #define MD_FIELD_ID_MAX_RESERVED_PER_TDMR 0x9100000100000009ULL
> #define MD_FIELD_ID_PAMT_4K_ENTRY_SIZE 0x9100000100000010ULL
> @@ -160,6 +164,14 @@ struct tdx_sys_info_version {
> u32 build_date;
> };
>
> +/* Class "CMR Info" */
> +#define TDX_MAX_CMRS 32 /* architectural */
> +struct tdx_sys_info_cmr {
> + u16 num_cmrs;
> + u64 cmr_base[TDX_MAX_CMRS];
> + u64 cmr_size[TDX_MAX_CMRS];
> +};
> +
> /* Class "TDMR info" */
> struct tdx_sys_info_tdmr {
> u16 max_tdmrs;
> @@ -170,6 +182,7 @@ struct tdx_sys_info_tdmr {
> struct tdx_sys_info {
> struct tdx_sys_info_features features;
> struct tdx_sys_info_version version;
> + struct tdx_sys_info_cmr cmr;
> struct tdx_sys_info_tdmr tdmr;
> };
>
Powered by blists - more mailing lists