linux-kernel - Re: [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <op.2pn6wbdwwjvjmi@hhuan26-mobl.amr.corp.intel.com>
Date: Thu, 20 Jun 2024 10:06:01 -0500
From: "Haitao Huang" <haitao.huang@...ux.intel.com>
To: "chenridong@...wei.com" <chenridong@...wei.com>,
 "linux-sgx@...r.kernel.org" <linux-sgx@...r.kernel.org>,
 "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>, "mkoutny@...e.com"
 <mkoutny@...e.com>, "dave.hansen@...ux.intel.com"
 <dave.hansen@...ux.intel.com>, "tim.c.chen@...ux.intel.com"
 <tim.c.chen@...ux.intel.com>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>, "mingo@...hat.com" <mingo@...hat.com>,
 "tglx@...utronix.de" <tglx@...utronix.de>, "tj@...nel.org" <tj@...nel.org>,
 "jarkko@...nel.org" <jarkko@...nel.org>, "Mehta, Sohil"
 <sohil.mehta@...el.com>, "hpa@...or.com" <hpa@...or.com>, "bp@...en8.de"
 <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>, "Huang, Kai"
 <kai.huang@...el.com>
Cc: "mikko.ylinen@...ux.intel.com" <mikko.ylinen@...ux.intel.com>,
 "seanjc@...gle.com" <seanjc@...gle.com>, "anakrish@...rosoft.com"
 <anakrish@...rosoft.com>, "Zhang, Bo" <zhanb@...rosoft.com>,
 "kristen@...ux.intel.com" <kristen@...ux.intel.com>, "yangjie@...rosoft.com"
 <yangjie@...rosoft.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
 "chrisyan@...rosoft.com" <chrisyan@...rosoft.com>
Subject: Re: [PATCH v15 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

Hi Kai

On Thu, 20 Jun 2024 05:30:16 -0500, Huang, Kai <kai.huang@...el.com> wrote:

>
> On 18/06/2024 12:53 am, Huang, Haitao wrote:
>> From: Kristen Carlson Accardi <kristen@...ux.intel.com>
>>
>> Previous patches have implemented all infrastructure needed for
>> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
>> pages are still tracked in the global LRU as sgx_epc_page_lru() always
>> returns reference to the global LRU.
>>
>> Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
>> given EPC page is allocated.
>>
>> This makes all EPC pages tracked in per-cgroup LRUs and the global
>> reclaimer (ksgxd) will not be able to reclaim any pages from the global
>> LRU. However, in cases of over-committing, i.e., the sum of cgroup
>> limits greater than the total capacity, cgroups may never reclaim but
>> the total usage can still be near the capacity. Therefore a global
>> reclamation is still needed in those cases and it should be performed
>> from the root cgroup.
>>
>> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
>> when cgroup is enabled. Similar to sgx_cgroup_reclaim_pages(), return
>> the next cgroup so callers can use it as the new starting node for next
>> round of reclamation if needed.
>>
>> Also update sgx_can_reclaim_global(), to check emptiness of LRUs of all
>> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
>>
>> Finally, change sgx_reclaim_direct(), to check and ensure there are free
>> pages at cgroup level so forward progress can be made by the caller.
>
> Reading above, it's not clear how the _new_ global reclaim works with
> multiple LRUs.
>
> E.g., the current global reclaim essentially treats all EPC pages equally
> when scanning those pages.  From the above, I don't see how this is
> achieved in the new global reclaim.
>
> The changelog should:
>
> 1) describe the how does existing global reclaim work, and then describe
> how to achieve the same beahviour in the new global reclaim which works
> with multiple LRUs;
>
> 2) If there's any behaviour difference between the "existing" vs the  
> "new"
> global reclaim, the changelog should point out the difference, and  
> explain
> why such difference is OK.

Sure I can explain. here is what I plan to add for v16:

Note the original global reclaimer has
only one LRU and always scans and reclaims from the head of this global
LRU. The new global reclaimer always starts the scanning from the root
node, only moves down to its descendants if more reclamation is needed
or the root node does not have SGX_NR_TO_SCAN (16) pages in the LRU.
This makes the enclave pages in the root node more likely being
reclaimed if they are not frequently used (not 'young'). Unless we track
pages in one LRU again, we can not really match exactly the same
behavior of the original global reclaimer. And one-LRU approach would
make per-cgroup reclamation scanning and reclaiming too complex.  On the
other hand, this design is acceptable for following reasons:

1) For all purposes of using cgroups, enclaves will need live in
      non-root (leaf for cgroup v2) nodes where limits can be enforced
      per-cgroup.
2) Global reclamation now only happens in situation mentioned above when
      a lower level cgroup not at its limit can't allocate due to over
      commit at global level.
3) The pages in root being slightly penalized are not busily used
      anyway.
4) In cases that multiple rounds of reclamation is needed, the caller of
      sgx_reclaim_page_global() can still recall for reclaiming in 'next'
      descendant in round robin way, each round scans for SGX_NR_SCAN pages
      from the head of 'next' cgroup's LRU.


>
>>
>> With these changes, the global reclamation and per-cgroup reclamation
>> both work properly with all pages tracked in per-cgroup LRUs.
>>
>
> [...]
>
>>
>> -static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>> +static struct misc_cg *sgx_reclaim_pages_global(struct misc_cg  
>> *next_cg,
>> +						struct mm_struct *charge_mm)
>>   {
>> +	if (IS_ENABLED(CONFIG_CGROUP_MISC))
>> +		return sgx_cgroup_reclaim_pages(misc_cg_root(), next_cg, charge_mm);
>> +
>>   	sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> +	return NULL;
>>   }
>>
>>   /*
>> @@ -414,12 +443,35 @@ static void sgx_reclaim_pages_global(struct  
>> mm_struct *charge_mm)
>>    */
>>   void sgx_reclaim_direct(void)
>>   {
>> +	struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
>> +	struct misc_cg *cg = misc_from_sgx(sgx_cg);
>
> From below @sgx_cg could be NULL.  It's not immediately clear whether  
> calling
> misc_from_sgx(sgx_cg) unconditionally is safe here.
>
> Leave the initiaization of @cg to a later phase where @sgx_cg is
> guaranteed not being NULL, or initialize @cg to NULL here and update  
> later.
>

Ok

>> +	struct misc_cg *next_cg = NULL;
>> +
>> +	/*
>> +	 * Make sure there are some free pages at both cgroup and global  
>> levels.
>> +	 * In both cases, only make one attempt of reclamation to avoid  
>> lengthy
>> +	 * block on the caller.
>> +	 */
>> +	if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg))
>> +		next_cg = sgx_cgroup_reclaim_pages(cg, next_cg, current->mm);
>
> I don't quite follow the logic.
>
> First of all, sgx_cgroup_reclaim_pages() isn't called in a loop, so why
> not just do:
>
> 	next_cg = sgx_cgroup_reclaim_pages(cg, NULL, current->mm);
>

Okay, I can replace it with NULL. I thought it was convenient to grep all  
such calls to see if they are used consistently.

> And what is the point of set @next_cg here, since ...
>
>
>> +
>> +	if (next_cg != cg)
>> +		put_misc_cg(next_cg);
>> +
>> +	next_cg = NULL;
>
> ... here @next_cg is reset to NULL ?
>
> Looks the only reason is you need to do ...
>
> 	put_misc_cg(next_cg);
>
> ... above?
>

yes

> This piece of code appears repeatedly in this file.  Is there any way we
> can get rid of it?
>

sgx_cgroup_reclaim_pages() returns 'next' in case caller needs to run it  
in loop. In this case, no.

I could create a thin wrapper for this case, but since there is only one  
such case, I did not fee it's worth the extra layer.

>>   	if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
>> -		sgx_reclaim_pages_global(current->mm);
>> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
>
> And this doesn't seems "global reclaim" at all?
>
> Because it essentially equals to:
>
> 	next_cg = sgx_reclaim_pages_global(NULL, current->mm);
>
> which always reclaims from the ROOT.
>
> So each call to sgx_reclaim_direct() will always reclaim from the ROOT --
> any other LRUs in the hierarchy will unlikely to get any chance to be
> reclaimed.

No, the current cgroup is always checked and called to reclaim if needed.
Global reclaim only happens when system-wide water mark SGX_NR_LOW_PAGES  
is violated.

>
>> +
>> +	if (next_cg != misc_cg_root())
>> +		put_misc_cg(next_cg);
>> +
>> +	sgx_put_cg(sgx_cg);
>>   }
>>
>>   static int ksgxd(void *p)
>>   {
>> +	struct misc_cg *next_cg = NULL;
>> +
>>   	set_freezable();
>>
>>   	/*
>> @@ -437,11 +489,15 @@ static int ksgxd(void *p)
>>   				     kthread_should_stop() ||
>>   				     sgx_should_reclaim_global(SGX_NR_HIGH_PAGES));
>>
>> -		if (sgx_should_reclaim_global(SGX_NR_HIGH_PAGES))
>> +		while (!kthread_should_stop() &&  
>> sgx_should_reclaim_global(SGX_NR_HIGH_PAGES)) {
>>   			/* Indirect reclaim, no mm to charge, so NULL: */
>> -			sgx_reclaim_pages_global(NULL);
>> +			next_cg = sgx_reclaim_pages_global(next_cg, NULL);
>> +			cond_resched();
>
> Should the 'put_misc_cg(next_cg)' be done within the while() loop but not
> below?

No, whenever there is a loop, we want to reclaim in round robin way.

>> +		}
>>
>> -		cond_resched();
>> +		if (next_cg != misc_cg_root())
>> +			put_misc_cg(next_cg);
>> +		next_cg = NULL;
>
> Again, it doesn't seems "global reclaim" here, since you always restart
> from the ROOT once the target pages have been reclaimed.
>
> AFAICT it's completely different from the existing global reclaim.
>
Hopefully adding comments mentioned above clarifies this.

>>   	}
>>
>>   	return 0;
>> @@ -583,6 +639,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page  
>> *page)
>>    */
>>   struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim  
>> reclaim)
>>   {
>> +	struct misc_cg *next_cg = NULL;
>>   	struct sgx_cgroup *sgx_cg;
>>   	struct sgx_epc_page *page;
>>   	int ret;
>> @@ -616,10 +673,19 @@ struct sgx_epc_page *sgx_alloc_epc_page(void  
>> *owner, enum sgx_reclaim reclaim)
>>   			break;
>>   		}
>>
>> -		sgx_reclaim_pages_global(current->mm);
>> +		/*
>> +		 * At this point, the usage within this cgroup is under its
>> +		 * limit but there is no physical page left for allocation.
>> +		 * Perform a global reclaim to get some pages released from any
>> +		 * cgroup with reclaimable pages.
>> +		 */
>> +		next_cg = sgx_reclaim_pages_global(next_cg, current->mm);
>>   		cond_resched();
>>   	}
>
> Ditto IIUC.
>
>