[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5e009636c5144622e0a910a459cd9d05976715e.camel@intel.com>
Date: Tue, 16 Apr 2024 13:22:06 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "hpa@...or.com" <hpa@...or.com>, "tim.c.chen@...ux.intel.com"
	<tim.c.chen@...ux.intel.com>, "linux-sgx@...r.kernel.org"
	<linux-sgx@...r.kernel.org>, "x86@...nel.org" <x86@...nel.org>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"jarkko@...nel.org" <jarkko@...nel.org>, "cgroups@...r.kernel.org"
	<cgroups@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "mkoutny@...e.com" <mkoutny@...e.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>, "haitao.huang@...ux.intel.com"
	<haitao.huang@...ux.intel.com>, "Mehta, Sohil" <sohil.mehta@...el.com>,
	"tj@...nel.org" <tj@...nel.org>, "mingo@...hat.com" <mingo@...hat.com>,
	"bp@...en8.de" <bp@...en8.de>
CC: "mikko.ylinen@...ux.intel.com" <mikko.ylinen@...ux.intel.com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "anakrish@...rosoft.com"
	<anakrish@...rosoft.com>, "Zhang, Bo" <zhanb@...rosoft.com>,
	"kristen@...ux.intel.com" <kristen@...ux.intel.com>, "yangjie@...rosoft.com"
	<yangjie@...rosoft.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"chrisyan@...rosoft.com" <chrisyan@...rosoft.com>
Subject: Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup
 functionality
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@...ux.intel.com>
> 
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
> 
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
> 
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
> 
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@...el.com>
> Signed-off-by: Kristen Carlson Accardi <kristen@...ux.intel.com>
> Co-developed-by: Haitao Huang <haitao.huang@...ux.intel.com>
> Signed-off-by: Haitao Huang <haitao.huang@...ux.intel.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@...nel.org>
> Reviewed-by: Tejun Heo <tj@...nel.org>
> Tested-by: Jarkko Sakkinen <jarkko@...nel.org>
I don't see any big issue, so feel free to add:
Reviewed-by: Kai Huang <kai.huang@...el.com>
Nitpickings below:
[...]
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
It doesn't seem you need the above two here.
Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.
However I think it's better to explicitly include <linux/slab.h> since
kzalloc()/kfree() are used.
Btw, I am not sure whether you want to use <linux/kernel.h> because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.
> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;
The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.
[...]
> 
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
I don't see why you need <asm/sgx.h> here.  Also, ...
> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"
... "sgx.h" already includes <asm/sgx.h>
[...]
> 
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> +	/* get_current_misc_cg() never returns NULL when Kconfig enabled */
> +	return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}
I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.
I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.
[...]
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include <linux/highmem.h>
>  #include <linux/kthread.h>
>  #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
Is this needed?  I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?
[...]
[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).
Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.
	int __init cgroup_init(void)
	{
		...
 	        for_each_subsys(ss, ssid) {                                    
			if (ss->early_init) {                                                 
				...
                	} else {                                                
                		cgroup_init_subsys(ss, false);                         
                	}
			...
                	if (!cgroup_ssid_enabled(ssid))
                        	continue;
			...
		}
		...
	}
cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and
set the pointer to the @init_css_set.
	static void __init cgroup_init_subsys(struct cgroup_subsys *ss, 
						...)
	{
        	struct cgroup_subsys_state *css;
		...
        	css = ss->css_alloc(NULL);
        	/* We don't handle early failures gracefully */
        	BUG_ON(IS_ERR(css));
        	
		...
        	init_css_set.subsys[ss->id] = css;
	
		...
	}
All processes are by default associated to the @init_css_set:
	void cgroup_fork(struct task_struct *child)
	{
        	RCU_INIT_POINTER(child->cgroups, &init_css_set);
        	INIT_LIST_HEAD(&child->cg_list);
	}
At runtime, when a new cgroup is created in the hierarchy, the "cgroup"
can have a NULL @css if some subsystem is not enabled in it:
	static int cgroup_apply_control_enable(struct cgroup *cgrp)             
	{       
        	struct cgroup *dsct;                                            
        	struct cgroup_subsys_state *d_css;                              
        	struct cgroup_subsys *ss;
        	int ssid, ret;
        
        	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {        
                	for_each_subsys(ss, ssid) {
                        	struct cgroup_subsys_state *css = 
						cgroup_css(dsct, ss);                                              
                
                        	if (!(cgroup_ss_mask(dsct) & 
					(1 << ss->id)))                                                     
                                	continue;                               
        
	                        if (!css) {                                     
	                                css = css_create(dsct, ss);             
	                                if (IS_ERR(css))                        
	                                        return PTR_ERR(css);            
	                        }
				...
			}
		}
	}
We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the
css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL.
However, when a process is bound to a specific cgroup, the kernel tries to
get the cgorup's "effective css", and it seems this "effective css" cannot
be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which
the MISC cgroup does.
There are couple of code paths can lead to this, but they all reach to
	static struct css_set *find_existing_css_set(...)
	{
	        struct cgroup_root *root = cgrp->root;
	        struct cgroup_subsys *ss;
	        struct css_set *cset;
	        unsigned long key;
	        int i;  
		
	        for_each_subsys(ss, i) {
	                if (root->subsys_mask & (1UL << i)) {
	                        /*
	                         * @ss is in this hierarchy, so we want
	                         * the effective css from @cgrp.
	                         */
	                        template[i] = cgroup_e_css_by_mask(cgrp,
						ss); 
	                } else {        
	                        /*
	                         * @ss is not in this hierarchy, so we
	                         * don't want to change the css.
	                         */
	                        template[i] = old_cset->subsys[i];
	                }
	        }
		...
	}
Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys
is enabled in the root cgroup (which means MISC cgroup is not disabled by
kernel command line), or get the default css, which is @init_css_set-
>subsys[ssid], which is always valid for MISC cgroup.
 
And more specifically, the "effective css" in the cgroup_e_css_by_mask()
is done by searching the entire hierarchy, so MISC cgroup will always have
a valid "effective css".
	static struct cgroup_subsys_state *cgroup_e_css_by_mask(
				struct cgroup *cgrp,
				struct cgroup_subsys *ss)
	{               
        	lockdep_assert_held(&cgroup_mutex);
        	if (!ss)
        	        return &cgrp->self; 
		...
        	while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
        	        cgrp = cgroup_parent(cgrp);
       	        	if (!cgrp)
                        	return NULL;
        	}
		return cgroup_css(cgrp, ss);
	}
The comment of cgroup_e_css_by_mask() says:
* Similar to cgroup_css() but returns the effective css, which is defined
* as the matching css of the nearest ancestor including self which has @ss
* enabled.  If @ss is associated with the hierarchy @cgrp is on, this
* function is guaranteed to return non-NULL css.
It's hard for me to interpret the second sentence, specifically, what does
"@ss is associated with the hierarchy @cgrp is on" mean.  I interpret it
as "subsys is enabled in root and/or any descendants".
But again, in the find_existing_css_set() it is called when the root
cgroup has enabled the subsys, so it should always return a non-NULL css.
And that means for any process, get_current_misc_cg() cannot be NULL.
Powered by blists - more mailing lists
 
