[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5e009636c5144622e0a910a459cd9d05976715e.camel@intel.com>
Date: Tue, 16 Apr 2024 13:22:06 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "hpa@...or.com" <hpa@...or.com>, "tim.c.chen@...ux.intel.com"
<tim.c.chen@...ux.intel.com>, "linux-sgx@...r.kernel.org"
<linux-sgx@...r.kernel.org>, "x86@...nel.org" <x86@...nel.org>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"jarkko@...nel.org" <jarkko@...nel.org>, "cgroups@...r.kernel.org"
<cgroups@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
"tglx@...utronix.de" <tglx@...utronix.de>, "haitao.huang@...ux.intel.com"
<haitao.huang@...ux.intel.com>, "Mehta, Sohil" <sohil.mehta@...el.com>,
"tj@...nel.org" <tj@...nel.org>, "mingo@...hat.com" <mingo@...hat.com>,
"bp@...en8.de" <bp@...en8.de>
CC: "mikko.ylinen@...ux.intel.com" <mikko.ylinen@...ux.intel.com>,
"seanjc@...gle.com" <seanjc@...gle.com>, "anakrish@...rosoft.com"
<anakrish@...rosoft.com>, "Zhang, Bo" <zhanb@...rosoft.com>,
"kristen@...ux.intel.com" <kristen@...ux.intel.com>, "yangjie@...rosoft.com"
<yangjie@...rosoft.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
"chrisyan@...rosoft.com" <chrisyan@...rosoft.com>
Subject: Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup
functionality
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@...ux.intel.com>
>
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
>
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
>
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
>
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@...ux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@...ux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@...ux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@...nel.org>
> Reviewed-by: Tejun Heo <tj@...nel.org>
> Tested-by: Jarkko Sakkinen <jarkko@...nel.org>
I don't see any big issue, so feel free to add:
Reviewed-by: Kai Huang <kai.huang@...el.com>
Nitpickings below:
[...]
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
It doesn't seem you need the above two here.
Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.
However I think it's better to explicitly include <linux/slab.h> since
kzalloc()/kfree() are used.
Btw, I am not sure whether you want to use <linux/kernel.h> because looks
it contains a lot of unrelated staff. Anyway I guess nobody cares.
> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;
The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.
[...]
>
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
I don't see why you need <asm/sgx.h> here. Also, ...
> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"
... "sgx.h" already includes <asm/sgx.h>
[...]
>
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}
I spent some time looking into this. And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.
I typed my analysis below in [*]. And it would be helpful if any cgroup
expert can have a second eye on this.
[...]
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
Is this needed? I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?
[...]
[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).
Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.
int __init cgroup_init(void)
{
...
for_each_subsys(ss, ssid) {
if (ss->early_init) {
...
} else {
cgroup_init_subsys(ss, false);
}
...
if (!cgroup_ssid_enabled(ssid))
continue;
...
}
...
}
cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and
set the pointer to the @init_css_set.
static void __init cgroup_init_subsys(struct cgroup_subsys *ss,
...)
{
struct cgroup_subsys_state *css;
...
css = ss->css_alloc(NULL);
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));
...
init_css_set.subsys[ss->id] = css;
...
}
All processes are by default associated to the @init_css_set:
void cgroup_fork(struct task_struct *child)
{
RCU_INIT_POINTER(child->cgroups, &init_css_set);
INIT_LIST_HEAD(&child->cg_list);
}
At runtime, when a new cgroup is created in the hierarchy, the "cgroup"
can have a NULL @css if some subsystem is not enabled in it:
static int cgroup_apply_control_enable(struct cgroup *cgrp)
{
struct cgroup *dsct;
struct cgroup_subsys_state *d_css;
struct cgroup_subsys *ss;
int ssid, ret;
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
for_each_subsys(ss, ssid) {
struct cgroup_subsys_state *css =
cgroup_css(dsct, ss);
if (!(cgroup_ss_mask(dsct) &
(1 << ss->id)))
continue;
if (!css) {
css = css_create(dsct, ss);
if (IS_ERR(css))
return PTR_ERR(css);
}
...
}
}
}
We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the
css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL.
However, when a process is bound to a specific cgroup, the kernel tries to
get the cgorup's "effective css", and it seems this "effective css" cannot
be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which
the MISC cgroup does.
There are couple of code paths can lead to this, but they all reach to
static struct css_set *find_existing_css_set(...)
{
struct cgroup_root *root = cgrp->root;
struct cgroup_subsys *ss;
struct css_set *cset;
unsigned long key;
int i;
for_each_subsys(ss, i) {
if (root->subsys_mask & (1UL << i)) {
/*
* @ss is in this hierarchy, so we want
* the effective css from @cgrp.
*/
template[i] = cgroup_e_css_by_mask(cgrp,
ss);
} else {
/*
* @ss is not in this hierarchy, so we
* don't want to change the css.
*/
template[i] = old_cset->subsys[i];
}
}
...
}
Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys
is enabled in the root cgroup (which means MISC cgroup is not disabled by
kernel command line), or get the default css, which is @init_css_set-
>subsys[ssid], which is always valid for MISC cgroup.
And more specifically, the "effective css" in the cgroup_e_css_by_mask()
is done by searching the entire hierarchy, so MISC cgroup will always have
a valid "effective css".
static struct cgroup_subsys_state *cgroup_e_css_by_mask(
struct cgroup *cgrp,
struct cgroup_subsys *ss)
{
lockdep_assert_held(&cgroup_mutex);
if (!ss)
return &cgrp->self;
...
while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
cgrp = cgroup_parent(cgrp);
if (!cgrp)
return NULL;
}
return cgroup_css(cgrp, ss);
}
The comment of cgroup_e_css_by_mask() says:
* Similar to cgroup_css() but returns the effective css, which is defined
* as the matching css of the nearest ancestor including self which has @ss
* enabled. If @ss is associated with the hierarchy @cgrp is on, this
* function is guaranteed to return non-NULL css.
It's hard for me to interpret the second sentence, specifically, what does
"@ss is associated with the hierarchy @cgrp is on" mean. I interpret it
as "subsys is enabled in root and/or any descendants".
But again, in the find_existing_css_set() it is called when the root
cgroup has enabled the subsys, so it should always return a non-NULL css.
And that means for any process, get_current_misc_cg() cannot be NULL.
Powered by blists - more mailing lists