linux-kernel - Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad7aafb88e45e5176d15eedea60695e104d24751.camel@intel.com>
Date:   Mon, 6 Nov 2023 12:09:45 +0000
From:   "Huang, Kai" <kai.huang@...el.com>
To:     "hpa@...or.com" <hpa@...or.com>,
        "linux-sgx@...r.kernel.org" <linux-sgx@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "jarkko@...nel.org" <jarkko@...nel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "mkoutny@...e.com" <mkoutny@...e.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "haitao.huang@...ux.intel.com" <haitao.huang@...ux.intel.com>,
        "Mehta, Sohil" <sohil.mehta@...el.com>,
        "tj@...nel.org" <tj@...nel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>
CC:     "mikko.ylinen@...ux.intel.com" <mikko.ylinen@...ux.intel.com>,
        "Christopherson,, Sean" <seanjc@...gle.com>,
        "Zhang, Bo" <zhanb@...rosoft.com>,
        "kristen@...ux.intel.com" <kristen@...ux.intel.com>,
        "yangjie@...rosoft.com" <yangjie@...rosoft.com>,
        "sean.j.christopherson@...el.com" <sean.j.christopherson@...el.com>,
        "Li, Zhiquan1" <zhiquan1.li@...el.com>,
        "anakrish@...rosoft.com" <anakrish@...rosoft.com>
Subject: Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup
 functionality

On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@...ux.intel.com>
> 
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
> 
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.

Nit:

The above two paragraphs talk about what is EPC and EPC resource control needs
to be done separately, etc, but IMHO it lacks some background about "why" EPC
resource control is needed, e.g, from use case's perspective.

> 
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".

Please separate what the current misc cgroup provides, and how this patch is
going to utilize.

Please describe the changes in imperative mood. E.g, "report total EPC memory
via ...", instead of "... is reported via ...".

> 
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.

This isn't changelog material.  Please focus on describing the high level design
and why you chose such design.

> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@...ux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@...ux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@...ux.intel.com>
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig                     |  13 ++++
>  arch/x86/kernel/cpu/sgx/Makefile     |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++++++++++
>  arch/x86/kernel/cpu/sgx/main.c       |  28 ++++++++
>  arch/x86/kernel/cpu/sgx/sgx.h        |   3 +
>  6 files changed, 184 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
>  	  If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> +	bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> +	depends on X86_SGX && CGROUP_MISC
> +	help
> +	  Provides control over the EPC footprint of tasks in a cgroup via
> +	  the Miscellaneous cgroup controller.
> +
> +	  EPC is a subset of regular memory that is usable only by SGX
> +	  enclaves and is very limited in quantity, e.g. less than 1%
> +	  of total DRAM.
> +
> +	  Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>  	bool "X86 userspace shadow stack"
>  	depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>  	ioctl.o \
>  	main.o
>  obj-$(CONFIG_X86_SGX_KVM)	+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)	       += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> +	return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> +	return !cgroup_subsys_enabled(misc_cgrp_subsys);

From below, the root EPC cgroup is dynamically allocated.  Shouldn't it also
check whether the root EPC cgroup is valid?

> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +	int ret;
> +
> +	if (sgx_epc_cgroup_disabled())
> +		return NULL;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +	ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	if (!ret) {
> +		/* No epc_cg returned, release ref from get_current_misc_cg() */
> +		put_misc_cg(epc_cg->cg);
> +		return ERR_PTR(-ENOMEM);

misc_cg_try_charge() returns 0 when successfully charged, no?

> +	}
> +
> +	/* Ref released in sgx_epc_cgroup_uncharge() */
> +	return epc_cg;
> +}

IMHO the above _try_charge() returning a pointer of EPC cgroup is a little bit
odd, because it doesn't match the existing misc_cg_try_charge() which returns
whether the charge is successful or not.  sev_misc_cg_try_charge() matches
misc_cg_try_charge() too. 

I think it's better to split "getting EPC cgroup" part out as a separate helper,
and make this _try_charge() match existing pattern:

	struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
	{
		if (sgx_epc_cgroup_disabled())
			return NULL;
	
		return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
	}

	int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
	{
		if (!epc_cg)
			return -EINVAL;
	
		return misc_cg_try_charge(epc_cg->cg);
	}

Having sgx_get_current_epc_cg() also makes the caller easier to read, because we
can immediately know we are going to charge the *current* EPC cgroup, but not
some cgroup hidden within sgx_epc_cgroup_try_charge().

> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg:	the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> +	if (sgx_epc_cgroup_disabled())
> +		return;

If with above change, check !epc_cg instead.

> +
> +	misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> +	/* Ref got from sgx_epc_cgroup_try_charge() */
> +	put_misc_cg(epc_cg->cg);
> +}
> 	
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> +	if (!epc_cg)
> +		return;
> +
> +	kfree(epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_operations_struct sgx_epc_cgroup_ops = {
> +	.alloc = sgx_epc_cgroup_alloc,
> +	.free = sgx_epc_cgroup_free,
> +};
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> +	if (!epc_cg)
> +		return -ENOMEM;
> +
> +	cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
> +	cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> +	epc_cg->cg = cg;
> +	return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> +	struct misc_cg *cg;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SGX))
> +		return 0;
> +
> +	cg = misc_cg_root();
> +	BUG_ON(!cg);

BUG_ON() will catch some eyeball, but it cannot be NULL in practice IIUC.

I am not sure whether you can just make misc @root_cg visible (instead of having
the misc_cg_root() helper) and directly use @root_cg here to avoid using the
BUG().  No opinion here.

> +
> +	return sgx_epc_cgroup_alloc(cg);

As mentioned above the memory allocation can fail, in which case EPC cgroup is
effectively disabled IIUC?

One way is to manually check whether root EPC cgroup is valid in
sgx_epc_cgroup_disabled().  Alternatively, you can have a static root EPC cgroup
here:

	static struct sgx_epc_cgroup root_epc_cg;

In this way you can have a sgx_epc_cgroup_init(&epc_cg), and call it from
sgx_epc_cgroup_alloc().

> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..c3abfe82be15
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES

Do you need this macro?

> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> +	return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +#else
> +struct sgx_epc_cgroup {
> +	struct misc_cg *cg;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);

Why do you need sgx_epc_cgroup_lru_empty() here? 

> +
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 166692f2d501..07606f391540 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
>  #include <linux/node.h>
>  #include <linux/pagemap.h>
>  #include <linux/ratelimit.h>
> @@ -17,6 +18,7 @@
>  #include "driver.h"
>  #include "encl.h"
>  #include "encls.h"
> +#include "epc_cgroup.h"
>  
>  struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>  static int sgx_nr_epc_sections;
> @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  {
>  	struct sgx_epc_page *page;
> +	struct sgx_epc_cgroup *epc_cg;
> +
> +	epc_cg = sgx_epc_cgroup_try_charge();
> +	if (IS_ERR(epc_cg))
> +		return ERR_CAST(epc_cg);
>  
>  	for ( ; ; ) {
>  		page = __sgx_alloc_epc_page();
> @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>  			break;
>  		}
>  
> +		/*
> +		 * Need to do a global reclamation if cgroup was not full but free
> +		 * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> +		 */
>  		sgx_reclaim_pages();

What's the final behaviour?  IIUC it should be reclaiming from the *current* EPC
cgroup?  If so shouldn't we just pass the @epc_cg to it here?

I think we can make this patch as "structure" patch w/o actually having EPC
cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.

And we can have one patch to change sgx_reclaim_pages() to take the 'struct
sgx_epc_lru_list *' as argument:

	void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
	{
		...
	}

Then here we can have something like:

	void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
	{
		struct sgx_epc_lru_list *lru = 
			epc_cg ? &epc_cg->lru : &sgx_global_lru;

		sgx_reclaim_pages_lru(lru);
	}

Makes sense?

>  		cond_resched();
>  	}
>  
> +	if (!IS_ERR(page)) {
> +		WARN_ON_ONCE(page->epc_cg);
> +		page->epc_cg = epc_cg;
> +	} else {
> +		sgx_epc_cgroup_uncharge(epc_cg);
> +	}
> +
>  	if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>  		wake_up(&ksgxd_waitq);
>  
> @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>  	struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>  	struct sgx_numa_node *node = section->node;
>  
> +	if (page->epc_cg) {
> +		sgx_epc_cgroup_uncharge(page->epc_cg);
> +		page->epc_cg = NULL;
> +	}
> +
>  	spin_lock(&node->lock);
>  
>  	page->owner = NULL;
> @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
>  		section->pages[i].flags = 0;
>  		section->pages[i].owner = NULL;
>  		section->pages[i].poison = 0;
> +		section->pages[i].epc_cg = NULL;
>  		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
>  	}
>  
> @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
>  static bool __init sgx_page_cache_init(void)
>  {
>  	u32 eax, ebx, ecx, edx, type;
> +	u64 capacity = 0;
>  	u64 pa, size;
>  	int nid;
>  	int i;
> @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
>  
>  		sgx_epc_sections[i].node =  &sgx_numa_nodes[nid];
>  		sgx_numa_nodes[nid].size += size;
> +		capacity += size;
>  
>  		sgx_nr_epc_sections++;
>  	}
> @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
>  		return false;
>  	}
>  
> +	misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +
>  	return true;
>  }

I would separate setting up capacity as a separate patch.

>  
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index d2dad21259a8..b1786774b8d2 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -29,12 +29,15 @@
>  /* Pages on free list */
>  #define SGX_EPC_PAGE_IS_FREE		BIT(1)
>  
> +struct sgx_epc_cgroup;
> +
>  struct sgx_epc_page {
>  	unsigned int section;
>  	u16 flags;
>  	u16 poison;
>  	struct sgx_encl_page *owner;
>  	struct list_head list;
> +	struct sgx_epc_cgroup *epc_cg;
>  };
> 

Adding @epc_cg unconditionally means even with !CONFIG_CGROUP_SGX_EPC the memory
is still occupied.  IMHO that would bring non-trivial memory waste as it's 8-
bytes for each EPC page.

If it's not good to have "#ifdef CONFIG_CGROUP_SGX_EPC" in the .c file, then
perhaps we can have some helper here, e.g.,


static inline sgx_epc_page_set_cg(struct sgx_epc_page *epc_page,
				  struct sgx_epc_cgroup *epc_cg)
{
#ifdef CONFIG_CGROUP_SGX_EPC
	epc_page->epc_cg = epc_cg;
#endif
}