[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <22eaec04-a950-413e-b9a0-885a077475e8@intel.com>
Date: Fri, 10 May 2024 16:47:50 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@...el.com>, <jarkko@...nel.org>
CC: <dave.hansen@...ux.intel.com>, <haitao.huang@...ux.intel.com>,
<kai.huang@...el.com>, <kailun.qin@...el.com>,
<linux-kernel@...r.kernel.org>, <linux-sgx@...r.kernel.org>,
<mona.vij@...el.com>, <stable@...r.kernel.org>
Subject: Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
Hi Dmitrii,
Thank you very much for uncovering and fixing this issue.
On 4/30/2024 7:38 AM, Dmitrii Kuvaiskii wrote:
> On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
>> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
>>> Two enclave threads may try to add and remove the same enclave page
>>> simultaneously (e.g., if the SGX runtime supports both lazy allocation
>>> and `MADV_DONTNEED` semantics). Consider this race:
>>>
>>> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
>>> after removing the page table entry and right before re-acquiring the
>>> enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
>>> 2. T2 tries to access the page, and #PF[not_present] is raised. The
>>> condition to EAUG in sgx_vma_fault() is not satisfied because the
>>> page is still present in encl->page_array, thus the SGX driver
>>> assumes that the fault happened because the page was swapped out. The
>>> driver continues on a code path that installs a page table entry
>>> *without* performing EAUG.
>>> 3. The enclave page metadata is in inconsistent state: the PTE is
>>> installed but there was no EAUG. Thus, T2 in userspace infinitely
>>> receives SIGSEGV on this page (and EACCEPT always fails).
>>>
>>> Fix this by making sure that T1 (the page-removing thread) always wins
>>> this data race. In particular, the page-being-removed is marked as such,
>>> and T2 retries until the page is fully removed.
>>>
>>> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
>>> Cc: stable@...r.kernel.org
>>> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@...el.com>
>>> ---
>>> arch/x86/kernel/cpu/sgx/encl.c | 3 ++-
>>> arch/x86/kernel/cpu/sgx/encl.h | 3 +++
>>> arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>>> 3 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>> index 41f14b1a3025..7ccd8b2fce5f 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>>>
>>> /* Entry successfully located. */
>>> if (entry->epc_page) {
>>> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
>>> + SGX_ENCL_PAGE_BEING_REMOVED))
>>> return ERR_PTR(-EBUSY);
>>>
>>> return entry;
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
>>> index f94ff14c9486..fff5f2293ae7 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.h
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.h
>>> @@ -25,6 +25,9 @@
>>> /* 'desc' bit marking that the page is being reclaimed. */
>>> #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)
>>>
>>> +/* 'desc' bit marking that the page is being removed. */
>>> +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2)
>>> +
>>> struct sgx_encl_page {
>>> unsigned long desc;
>>> unsigned long vm_max_prot_bits:8;
>>> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> index b65ab214bdf5..c542d4dd3e64 100644
>>> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>>> * Do not keep encl->lock because of dependency on
>>> * mmap_lock acquired in sgx_zap_enclave_ptes().
>>> */
>>> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>>> mutex_unlock(&encl->lock);
>>>
>>> sgx_zap_enclave_ptes(encl, addr);
>>
>> It is somewhat trivial to NAK this as the commit message does
>> not do any effort describing the new flag. By default at least
>> I have strong opposition against any new flags related to
>> reclaiming even if it needs a bit of extra synchronization
>> work in the user space.
>>
>> One way to describe concurrency scenarios would be to take
>> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
>>
>> I.e. see the examples with CPU 1 and CPU 2.
>
> Thank you for the suggestion. Here is my new attempt at describing the racy
> scenario:
>
> Consider some enclave page added to the enclave. User space decides to
> temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
> on CPU1. At the same time, user space performs a memory access on the same
> page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
> Scenario proceeds as follows:
>
> /*
> * CPU1: User space performs
> * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
> * on a single enclave page
> */
> sgx_encl_remove_pages() {
>
> mutex_lock(&encl->lock);
>
> entry = sgx_encl_load_page(encl);
> /*
> * verify that page is
> * trimmed and accepted
> */
>
> mutex_unlock(&encl->lock);
>
> /*
> * remove PTE entry; cannot
> * be performed under lock
> */
> sgx_zap_enclave_ptes(encl);
> /*
> * Fault on CPU2
> */
Please highlight that this fault is related to the page that
is in process of being removed on CPU1.
> sgx_vma_fault() {
> /*
> * PTE entry was removed, but the
> * page is still in enclave's xarray
> */
> xa_load(&encl->page_array) != NULL ->
> /*
> * SGX driver thinks that this page
> * was swapped out and loads it
> */
> mutex_lock(&encl->lock);
> /*
> * this is effectively a no-op
> */
> entry = sgx_encl_load_page_in_vma();
> /*
> * add PTE entry
> */
It may be helpful to highlight that this is a problem: "BUG: A PTE
is installed for a page in process of being removed." (please feel free
to expand)
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> /*
> * continue with page removal
> */
> mutex_lock(&encl->lock);
>
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> xa_erase(&encl->page_array);
>
> mutex_unlock(&encl->lock);
> }
>
> CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed. Because of this
> combination, any subsequent access to this page raises a fault, and the #PF
> handler sees the SGX bit set in the #PF error code and does not call
Which #PF handler?
> sgx_vma_fault() but instead raises a SIGSEGV. The userspace SIGSEGV handler
> cannot perform EACCEPT because the page was not EAUGed. Thus, the user
> space is stuck with the inaccessible page.
>
> This race can be fixed by forcing the fault handler on CPU2 to back off if
> the page is currently being removed (on CPU1). Thus a simple change is to
> introduce a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by default
> and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until the
> page is completely removed. After that, any memory access to this page
> results in a normal "allocate and EAUG a page on #PF" flow.
I have been tripped by these page flags before so would appreciate
another opinion. From my side this looks like an appropriate fix.
Reinette
Powered by blists - more mailing lists