[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e0bb87f05b79317a06ed2d8ab5e2f5cf6132b6a.camel@kernel.org>
Date: Wed, 10 Nov 2021 17:11:50 +0200
From: Jarkko Sakkinen <jarkko@...nel.org>
To: Reinette Chatre <reinette.chatre@...el.com>,
dave.hansen@...ux.intel.com, tglx@...utronix.de, bp@...en8.de,
mingo@...hat.com, linux-sgx@...r.kernel.org, x86@...nel.org
Cc: seanjc@...gle.com, tony.luck@...el.com, hpa@...or.com,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH V2] x86/sgx: Fix free page accounting
On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that will run when it (sgx_nr_free_pages) goes below a watermark
> to ensure that there are always some free pages available to, for
> example, support efficient page faults.
>
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is read without any
> protection and updated with inconsistent protection by any one
> of the spin locks associated with the individual NUMA nodes.
> For example:
>
> CPU_A CPU_B
> ----- -----
> spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
> ... ...
> sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--;
>
> spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);
I don't understand the "NOT SAFE" part here. It is safe to access
the variable, even when it is not atomic...
I don't understand what the sequence above should tell me.
> The consequence of sgx_nr_free_pages not being protected is that
> its value may not accurately reflect the actual number of free
> pages on the system, impacting the availability of free pages in
> support of many flows. The problematic scenario isu when the
In non-atomicity is not a problem, when it is not a problem :-)
> reclaimer does not run because it believes there to be sufficient
> free pages while any attempt to allocate a page fails because there
> are no free pages available.
>
> The worst scenario observed was a user space hang because of
> repeated page faults caused by no free pages made available.
>
> The following flow was encountered:
> asm_exc_page_fault
> ...
> sgx_vma_fault()
> sgx_encl_load_page()
> sgx_encl_eldu() // Encrypted page needs to be loaded from backing
> // storage into newly allocated SGX memory page
> sgx_alloc_epc_page() // Allocate a page of SGX memory
> __sgx_alloc_epc_page() // Fails, no free SGX memory
> ...
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
> wake_up(&ksgxd_waitq);
> return -EBUSY; // Return -EBUSY giving reclaimer time to run
> return -EBUSY;
> return -EBUSY;
> return VM_FAULT_NOPAGE;
>
> The reclaimer is triggered in above flow with the following code:
>
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> return sgx_nr_free_pages < watermark &&
> !list_empty(&sgx_active_page_list);
> }
>
> In the problematic scenario there were no free pages available yet the
> value of sgx_nr_free_pages was above the watermark. The allocation of
> SGX memory thus always failed because of a lack of free pages while no
> free pages were made available because the reclaimer is never started
> because of sgx_nr_free_pages' incorrect value. The consequence was that
> user space kept encountering VM_FAULT_NOPAGE that caused the same
> address to be accessed repeatedly with the same result.
That causes sgx_should_reclaim() executed to be multiple times as the
fault is retried. Eventually it should be successful.
> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
>
> Cc: stable@...r.kernel.org
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@...ux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@...el.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@...el.com>
I'm not yet sure if this a bug, even if it'd be a improvement
to the performance.
/Jarkko
Powered by blists - more mailing lists