linux-kernel - Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93a2e333-f604-fc44-ee35-3da06b17392a@intel.com>
Date:   Mon, 15 Mar 2021 09:35:03 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     Jarkko Sakkinen <jarkko@...nel.org>, linux-sgx@...r.kernel.org
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 3/3] x86/sgx: Add a basic NUMA allocation scheme to
 sgx_alloc_epc_page()

On 3/13/21 8:01 AM, Jarkko Sakkinen wrote:
> Background
> ==========
> 
> EPC section is covered by one or more SRAT entries that are associated with
> one and only one PXM (NUMA node). The motivation behind this patch is to
> provide basic elements of building allocation scheme based on this premise.

A better set of background information here would also remind folks what
an 'EPC section' is.

> Just like normal RAM, enclave memory (EPC) should be covered by entries
> in the ACPI SRAT table.  These entries allow each EPC section to be
> associated with a NUMA node.
> 
> Use this information to implement a simple NUMA-aware allocator for
> enclave memory.

SGX enclave memory is enumerated by the processor in contiguous physical
ranges called "EPC sections".  Currently, there is a free list per
section, but allocations simply target the lowest-numbered sections.
This is functional, but has no NUMA awareness.

Fortunately, EPC sections are covered by entries in the ACPI SRAT table.
 These entries allow each EPC section to be associated with a NUMA node,
just like normal RAM.


> Solution
> ========
> 
> Use phys_to_target_node() to associate each NUMA node with the EPC
> sections contained within its range. In sgx_alloc_epc_page(), first try
> to allocate from the NUMA node, where the CPU is executing. If that
> fails, allocate from other nodes, iterating them from the current node
> in order.

To me, this is just telling us what the code does.  It's not very
useful.  I'd say:

Implement a NUMA-aware enclave page allocator.  Mirror the buddy
allocator and maintain a list of enclave pages for each NUMA node.
Attempt to allocate enclave memory first from local nodes, then fall
back to other nodes.

Note that the fallback is not as sophisticated as the buddy allocator
and is itself not aware of NUMA distances.  When a node's free list is
empty, it searches for the next-highest node with enclave pages (and
will wrap if necessary).  This could be improved in the future.

> Other
> =====
> 
> NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().

Reading the changelog, it's not obvious that you're talking about a
Kconfig variable here.

I was also thinking, this says:

	# Keep arch NUMA mapping infrastructure post-init.
	config NUMA_KEEP_MEMINFO
	        bool

But we only need it during SGX init.  Could you explain a bit why it's
needed here specifically?  Superficially it seems like we only need this
info *during* init.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 513895af8ee7..3e6152a8dd2b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1930,6 +1930,7 @@ config X86_SGX
>  	depends on CRYPTO_SHA256=y
>  	select SRCU
>  	select MMU_NOTIFIER
> +	select NUMA_KEEP_MEMINFO if NUMA
>  	help
>  	  Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions
>  	  that can be used by applications to set aside private regions of code
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index cb4561444b96..3b524a1361d6 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections;
>  static struct task_struct *ksgxd_tsk;
>  static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>  
> -/*
> - * These variables are part of the state of the reclaimer, and must be accessed
> - * with sgx_reclaimer_lock acquired.
> - */
> +/* The reclaimer lock protected variables prepend the lock. */
>  static LIST_HEAD(sgx_active_page_list);
> -
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
> +/* The free page list lock protected variables prepend the lock. */
> +static unsigned long sgx_nr_free_pages;
> +
> +/* Nodes with one or more EPC sections. */
> +static nodemask_t sgx_numa_mask;
> +
> +/*
> + * Array with one list_head for each possible NUMA node.  Each
> + * list contains all the sgx_epc_section's which are on that
> + * node.
> + */
> +static struct sgx_numa_node *sgx_numa_nodes;
> +
>  /*
>   * When the driver initialized, EPC pages go first here, as they could be
>   * initialized to an active enclave, on kexec entry.
> @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void)
>  	}
>  }
>  
> -static unsigned long sgx_nr_free_pages(void)
> -{
> -	unsigned long cnt = 0;
> -	int i;
> -
> -	for (i = 0; i < sgx_nr_epc_sections; i++)
> -		cnt += sgx_epc_sections[i].free_cnt;
> -
> -	return cnt;
> -}
> -
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
> -	return sgx_nr_free_pages() < watermark &&
> -	       !list_empty(&sgx_active_page_list);
> +	return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
>  }
>  
>  static int ksgxd(void *p)
> @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void)
>  	return true;
>  }
>  
> -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section)
> +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
>  {
> -	struct sgx_epc_page *page;
> +	struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> +	struct sgx_epc_page *page = NULL;
> +
> +	if (!node_isset(nid, sgx_numa_mask))
> +		return NULL;

I don't think this mask check is super necessary.  It won't ever trigger
for the "fallback" path.  For the node-local allocation, I guess it
saves a cacheline (node->free_page_list).

> -	spin_lock(&section->lock);
> +	spin_lock(&node->lock);
>  
> -	if (list_empty(&section->page_list)) {
> -		spin_unlock(&section->lock);
> +	if (list_empty(&node->free_page_list)) {
> +		spin_unlock(&node->lock);
>  		return NULL;
>  	}
>  
> -	page = list_first_entry(&section->page_list, struct sgx_epc_page, list);
> +	page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
>  	list_del_init(&page->list);
> -	section->free_cnt--;
> +	sgx_nr_free_pages--;
> +
> +	spin_unlock(&node->lock);
>  
> -	spin_unlock(&section->lock);
>  	return page;
>  }
>  
>  /**
>   * __sgx_alloc_epc_page() - Allocate an EPC page
>   *
> - * Iterate through EPC sections and borrow a free EPC page to the caller. When a
> - * page is no longer needed it must be released with sgx_free_epc_page().
> + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a

"borrow" is a weird word to use here.

> + * page is no longer needed it must be released with sgx_free_epc_page(). Start
> + * from the NUMA node, where the caller is executing.

I don't think we need to tell folks how allocators work: that "free" and
"alloc" are opposites.

>   * Return:
> - *   an EPC page,
> - *   -errno on error
> + * - an EPC page:	Free EPC pages were available.
> + * - ERR_PTR(-ENOMEM):	Run out of EPC pages.
>   */

I think returning NULL from a allocator is probably OK.  Folks
understand that NULL means ENOMEM.

>  struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  {
> -	struct sgx_epc_section *section;
>  	struct sgx_epc_page *page;
> -	int i;
> +	int nid = numa_node_id();
>  
> -	for (i = 0; i < sgx_nr_epc_sections; i++) {
> -		section = &sgx_epc_sections[i];
> +	/* Try to allocate EPC from the current node, first: */
> +	page = __sgx_alloc_epc_page_from_node(nid);
> +	if (page)
> +		return page;
>  
> -		page = __sgx_alloc_epc_page_from_section(section);
> +	/* Then, go through the other nodes: */
> +	while (true) {
> +		nid = next_node_in(nid, sgx_numa_mask);
> +		if (nid == numa_node_id())
> +			break;

One nit: This can _theoretically_ livelock.  Preemption is enabled here
and numa_node_id() can change.  It's theoretically possible that
numa_node_id()!=nid forever.

The way to prevent this is to read numa_node_id() once and then compare
'nid' against that single read.

> +		page = __sgx_alloc_epc_page_from_node(nid);
>  		if (page)
> -			return page;
> +			break;
>  	}
>  
> -	return ERR_PTR(-ENOMEM);
> +	return page;
>  }

I guess you probably wrote it this way because you prefer it.  But, this
can be written with a single call to __sgx_alloc_epc_page_from_node().

>  /**
> @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  void sgx_free_epc_page(struct sgx_epc_page *page)
>  {
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> +	struct sgx_numa_node *node = section->node;
>  	int ret;
>  
>  	WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret))
>  		return;
>  
> -	spin_lock(&section->lock);
> -	list_add_tail(&page->list, &section->page_list);
> -	section->free_cnt++;
> -	spin_unlock(&section->lock);
> +	spin_lock(&node->lock);
> +
> +	list_add_tail(&page->list, &node->free_page_list);
> +	sgx_nr_free_pages++;
> +
> +	spin_unlock(&node->lock);
>  }
>  
>  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  	}
>  
>  	section->phys_addr = phys_addr;
> -	spin_lock_init(&section->lock);
> -	INIT_LIST_HEAD(&section->page_list);
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		section->pages[i].section = index;
> @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> -	section->free_cnt = nr_pages;
> +	sgx_nr_free_pages += nr_pages;
>  	return true;
>  }
>  
> @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
>  	u64 pa, size;
> +	int nid;
>  	int i;
>  
> +	sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);

Needs a NULL check.

>  	for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
>  		cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
>  
> @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void)
>  			break;
>  		}
>  
> +		nid = numa_map_to_online_node(phys_to_target_node(pa));
> +		if (nid == NUMA_NO_NODE) {
> +			/* The physical address is already printed above. */
> +			pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> +			nid = 0;
> +		}
> +
> +		if (!node_isset(nid, sgx_numa_mask)) {
> +			spin_lock_init(&sgx_numa_nodes[nid].lock);
> +			INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list);
> +			node_set(nid, sgx_numa_mask);
> +		}
> +
> +		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
> +
>  		sgx_nr_epc_sections++;
>  	}

The rest looks OK.