[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YE+0HjwRqmVDXyCU@kernel.org>
Date: Mon, 15 Mar 2021 21:23:10 +0200
From: Jarkko Sakkinen <jarkko@...nel.org>
To: Dave Hansen <dave.hansen@...el.com>
Cc: linux-sgx@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to
sgx_alloc_epc_page()
On Mon, Mar 15, 2021 at 09:35:03AM -0700, Dave Hansen wrote:
> On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> > Background
> > ==========
> >
> > EPC section is covered by one or more SRAT entries that are associated with
> > one and only one PXM (NUMA node). The motivation behind this patch is to
> > provide basic elements of building allocation scheme based on this premise.
>
> A better set of background information here would also remind folks what
> an 'EPC section' is.
I don't actually know what exactly they are, because SDM's definition
is not very rigorous ;-)
I.e. it is not too precise between NUMA node and section relationship.
>
> > Just like normal RAM, enclave memory (EPC) should be covered by entries
> > in the ACPI SRAT table. These entries allow each EPC section to be
> > associated with a NUMA node.
> >
> > Use this information to implement a simple NUMA-aware allocator for
> > enclave memory.
>
> SGX enclave memory is enumerated by the processor in contiguous physical
> ranges called "EPC sections". Currently, there is a free list per
> section, but allocations simply target the lowest-numbered sections.
> This is functional, but has no NUMA awareness.
>
> Fortunately, EPC sections are covered by entries in the ACPI SRAT table.
> These entries allow each EPC section to be associated with a NUMA node,
> just like normal RAM.
Thanks!
> > Solution
> > ========
> >
> > Use phys_to_target_node() to associate each NUMA node with the EPC
> > sections contained within its range. In sgx_alloc_epc_page(), first try
> > to allocate from the NUMA node, where the CPU is executing. If that
> > fails, allocate from other nodes, iterating them from the current node
> > in order.
>
> To me, this is just telling us what the code does. It's not very
> useful. I'd say:
>
> Implement a NUMA-aware enclave page allocator. Mirror the buddy
> allocator and maintain a list of enclave pages for each NUMA node.
> Attempt to allocate enclave memory first from local nodes, then fall
> back to other nodes.
>
> Note that the fallback is not as sophisticated as the buddy allocator
> and is itself not aware of NUMA distances. When a node's free list is
> empty, it searches for the next-highest node with enclave pages (and
> will wrap if necessary). This could be improved in the future.
Thanks.
> > Other
> > =====
> >
> > NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().
>
> Reading the changelog, it's not obvious that you're talking about a
> Kconfig variable here.
>
> I was also thinking, this says:
>
> # Keep arch NUMA mapping infrastructure post-init.
> config NUMA_KEEP_MEMINFO
> bool
>
> But we only need it during SGX init. Could you explain a bit why it's
> needed here specifically? Superficially it seems like we only need this
> info *during* init.
Well, numa_meminfo is a static variable of mm/numa.c, so we cannot directly
access it. And phys_to_numa_node(), which accesses its data, does not
exist, unless NUMA_KEEP_MEMINFO is defined.
So, yes, theoretically we could pull the data withotu NUMA_KEEP_MEMINFO,
but thise would require to numa_meminfo a global variable.
We do not care about post-init part, in that sense you are right.
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 513895af8ee7..3e6152a8dd2b 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1930,6 +1930,7 @@ config X86_SGX
> > depends on CRYPTO_SHA256=y
> > select SRCU
> > select MMU_NOTIFIER
> > + select NUMA_KEEP_MEMINFO if NUMA
> > help
> > Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
> > that can be used by applications to set aside private regions of code
> > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> > index cb4561444b96..3b524a1361d6 100644
> > --- a/arch/x86/kernel/cpu/sgx/main.c
> > +++ b/arch/x86/kernel/cpu/sgx/main.c
> > @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
> > static struct task_struct *ksgxd_tsk;
> > static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
> >
> > -/*
> > - * These variables are part of the state of the reclaimer, and must be accessed
> > - * with sgx_reclaimer_lock acquired.
> > - */
> > +/* The reclaimer lock protected variables prepend the lock. */
> > static LIST_HEAD(sgx_active_page_list);
> > -
> > static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> >
> > +/* The free page list lock protected variables prepend the lock. */
> > +static unsigned long sgx_nr_free_pages;
> > +
> > +/* Nodes with one or more EPC sections. */
> > +static nodemask_t sgx_numa_mask;
> > +
> > +/*
> > + * Array with one list_head for each possible NUMA node. Each
> > + * list contains all the sgx_epc_section's which are on that
> > + * node.
> > + */
> > +static struct sgx_numa_node *sgx_numa_nodes;
> > +
> > /*
> > * When the driver initialized, EPC pages go first here, as they could be
> > * initialized to an active enclave, on kexec entry.
> > @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
> > }
> > }
> >
> > -static unsigned long sgx_nr_free_pages(void)
> > -{
> > - unsigned long cnt = 0;
> > - int i;
> > -
> > - for (i = 0; i < sgx_nr_epc_sections; i++)
> > - cnt += sgx_epc_sections[i].free_cnt;
> > -
> > - return cnt;
> > -}
> > -
> > static bool sgx_should_reclaim(unsigned long watermark)
> > {
> > - return sgx_nr_free_pages() < watermark &&
> > - !list_empty(&sgx_active_page_list);
> > + return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> > }
> >
> > static int ksgxd(void *p)
> > @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
> > return true;
> > }
> >
> > -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
> > +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
> > {
> > - struct sgx_epc_page *page;
> > + struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> > + struct sgx_epc_page *page = NULL;
> > +
> > + if (!node_isset(nid, sgx_numa_mask))
> > + return NULL;
>
> I don't think this mask check is super necessary. It won't ever trigger
> for the "fallback" path. For the node-local allocation, I guess it
> saves a cacheline (node->free_page_list).
>
> > - spin_lock(§ion->lock);
> > + spin_lock(&node->lock);
> >
> > - if (list_empty(§ion->page_list)) {
> > - spin_unlock(§ion->lock);
> > + if (list_empty(&node->free_page_list)) {
> > + spin_unlock(&node->lock);
> > return NULL;
> > }
> >
> > - page = list_first_entry(§ion->page_list, struct sgx_epc_page, list);
> > + page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
> > list_del_init(&page->list);
> > - section->free_cnt--;
> > + sgx_nr_free_pages--;
> > +
> > + spin_unlock(&node->lock);
> >
> > - spin_unlock(§ion->lock);
> > return page;
> > }
> >
> > /**
> > * __sgx_alloc_epc_page() - Allocate an EPC page
> > *
> > - * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> > - * page is no longer needed it must be released with sgx_free_epc_page().
> > + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a
>
> "borrow" is a weird word to use here.
>
> > + * page is no longer needed it must be released with sgx_free_epc_page(). Start
> > + * from the NUMA node, where the caller is executing.
>
> I don't think we need to tell folks how allocators work: that "free" and
> "alloc" are opposites.
>
> > * Return:
> > - * an EPC page,
> > - * -errno on error
> > + * - an EPC page: Free EPC pages were available.
> > + * - ERR_PTR(-ENOMEM): Run out of EPC pages.
> > */
>
> I think returning NULL from a allocator is probably OK. Folks
> understand that NULL means ENOMEM.
>
> > struct sgx_epc_page *__sgx_alloc_epc_page(void)
> > {
> > - struct sgx_epc_section *section;
> > struct sgx_epc_page *page;
> > - int i;
> > + int nid = numa_node_id();
> >
> > - for (i = 0; i < sgx_nr_epc_sections; i++) {
> > - section = &sgx_epc_sections[i];
> > + /* Try to allocate EPC from the current node, first: */
> > + page = __sgx_alloc_epc_page_from_node(nid);
> > + if (page)
> > + return page;
> >
> > - page = __sgx_alloc_epc_page_from_section(section);
> > + /* Then, go through the other nodes: */
> > + while (true) {
> > + nid = next_node_in(nid, sgx_numa_mask);
> > + if (nid == numa_node_id())
> > + break;
>
> One nit: This can _theoretically_ livelock. Preemption is enabled here
> and numa_node_id() can change. It's theoretically possible that
> numa_node_id()!=nid forever.
>
> The way to prevent this is to read numa_node_id() once and then compare
> 'nid' against that single read.
>
> > + page = __sgx_alloc_epc_page_from_node(nid);
> > if (page)
> > - return page;
> > + break;
> > }
> >
> > - return ERR_PTR(-ENOMEM);
> > + return page;
> > }
>
> I guess you probably wrote it this way because you prefer it. But, this
> can be written with a single call to __sgx_alloc_epc_page_from_node().
>
> > /**
> > @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > void sgx_free_epc_page(struct sgx_epc_page *page)
> > {
> > struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> > + struct sgx_numa_node *node = section->node;
> > int ret;
> >
> > WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> > if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
> > return;
> >
> > - spin_lock(§ion->lock);
> > - list_add_tail(&page->list, §ion->page_list);
> > - section->free_cnt++;
> > - spin_unlock(§ion->lock);
> > + spin_lock(&node->lock);
> > +
> > + list_add_tail(&page->list, &node->free_page_list);
> > + sgx_nr_free_pages++;
> > +
> > + spin_unlock(&node->lock);
> > }
> >
> > static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> > @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> > }
> >
> > section->phys_addr = phys_addr;
> > - spin_lock_init(§ion->lock);
> > - INIT_LIST_HEAD(§ion->page_list);
> >
> > for (i = 0; i < nr_pages; i++) {
> > section->pages[i].section = index;
> > @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
> > }
> >
> > - section->free_cnt = nr_pages;
> > + sgx_nr_free_pages += nr_pages;
> > return true;
> > }
> >
> > @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
> > {
> > u32 eax, ebx, ecx, edx, type;
> > u64 pa, size;
> > + int nid;
> > int i;
> >
> > + sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
>
> Needs a NULL check.
Ugh, lol, I did spot this but forgot to fix it. Darn, well, I'll fix it for
the next version :-)
>
> > for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
> > cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
> >
> > @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
> > break;
> > }
> >
> > + nid = numa_map_to_online_node(phys_to_target_node(pa));
> > + if (nid == NUMA_NO_NODE) {
> > + /* The physical address is already printed above. */
> > + pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> > + nid = 0;
> > + }
> > +
> > + if (!node_isset(nid, sgx_numa_mask)) {
> > + spin_lock_init(&sgx_numa_nodes[nid].lock);
> > + INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
> > + node_set(nid, sgx_numa_mask);
> > + }
> > +
> > + sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
> > +
> > sgx_nr_epc_sections++;
> > }
>
> The rest looks OK.
>
/Jarkko
Powered by blists - more mailing lists