linux-kernel - Re: [PATCH V2] x86/sgx: Fix free page accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e0bb87f05b79317a06ed2d8ab5e2f5cf6132b6a.camel@kernel.org>
Date:   Wed, 10 Nov 2021 17:11:50 +0200
From:   Jarkko Sakkinen <jarkko@...nel.org>
To:     Reinette Chatre <reinette.chatre@...el.com>,
        dave.hansen@...ux.intel.com, tglx@...utronix.de, bp@...en8.de,
        mingo@...hat.com, linux-sgx@...r.kernel.org, x86@...nel.org
Cc:     seanjc@...gle.com, tony.luck@...el.com, hpa@...or.com,
        linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH V2] x86/sgx: Fix free page accounting

On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that will run when it (sgx_nr_free_pages) goes below a watermark
> to ensure that there are always some free pages available to, for
> example, support efficient page faults.
> 
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is read without any
> protection and updated with inconsistent protection by any one
> of the spin locks associated with the individual NUMA nodes.
> For example:
> 
>       CPU_A                                 CPU_B
>       -----                                 -----
>  spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>  ...                                   ...
>  sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
> 
>  spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);

I don't understand the "NOT SAFE" part here. It is safe to access
the variable, even when it is not atomic...

I don't understand what the sequence above should tell me.

> The consequence of sgx_nr_free_pages not being protected is that
> its value may not accurately reflect the actual number of free
> pages on the system, impacting the availability of free pages in
> support of many flows. The problematic scenario isu when the

In non-atomicity is not a problem, when it is not a problem :-)

> reclaimer does not run because it believes there to be sufficient
> free pages while any attempt to allocate a page fails because there
> are no free pages available.
> 
> The worst scenario observed was a user space hang because of
> repeated page faults caused by no free pages made available.
> 
> The following flow was encountered:
> asm_exc_page_fault
>  ...
>    sgx_vma_fault()
>      sgx_encl_load_page()
>        sgx_encl_eldu() // Encrypted page needs to be loaded from backing
>                        // storage into newly allocated SGX memory page
>          sgx_alloc_epc_page() // Allocate a page of SGX memory
>            __sgx_alloc_epc_page() // Fails, no free SGX memory
>            ...
>            if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
>              wake_up(&ksgxd_waitq);
>            return -EBUSY; // Return -EBUSY giving reclaimer time to run
>        return -EBUSY;
>      return -EBUSY;
>    return VM_FAULT_NOPAGE;
> 
> The reclaimer is triggered in above flow with the following code:
> 
> static bool sgx_should_reclaim(unsigned long watermark)
> {
>         return sgx_nr_free_pages < watermark &&
>                !list_empty(&sgx_active_page_list);
> }
> 
> In the problematic scenario there were no free pages available yet the
> value of sgx_nr_free_pages was above the watermark. The allocation of
> SGX memory thus always failed because of a lack of free pages while no
> free pages were made available because the reclaimer is never started
> because of sgx_nr_free_pages' incorrect value. The consequence was that
> user space kept encountering VM_FAULT_NOPAGE that caused the same
> address to be accessed repeatedly with the same result.

That causes sgx_should_reclaim() executed to be multiple times as the
fault is retried. Eventually it should be successful.

> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
> 
> Cc: stable@...r.kernel.org
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@...ux.intel.com>
> Reviewed-by: Tony Luck <tony.luck@...el.com>
> Signed-off-by: Reinette Chatre <reinette.chatre@...el.com>

I'm not yet sure if this a bug, even if it'd be a improvement
to the performance.

/Jarkko