lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yw+KnRTrZ74qFUAA@xz-m1.local>
Date:   Wed, 31 Aug 2022 12:21:49 -0400
From:   Peter Xu <peterx@...hat.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     John Hubbard <jhubbard@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...e.de>,
        "Matthew Wilcox (Oracle)" <willy@...radead.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in
 GUP-fast

On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> On 30.08.22 21:18, John Hubbard wrote:
> > On 8/30/22 11:53, David Hildenbrand wrote:
> >> Good, I managed to attract the attention of someone who understands that machinery :)
> >>
> >> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> >> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> >> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> >> make it work as expected in any possible case), but I also stumbled eventually
> >> over a more generic issue that might need memory barriers.
> >>
> >> Any thoughts whether I am missing something or this is actually missing
> >> memory barriers?
> >>
> > 
> > It's actually missing memory barriers.
> > 
> > In fact, others have had that same thought! [1] :) In that 2019 thread,
> > I recall that this got dismissed because of a focus on the IPI-based
> > aspect of gup fast synchronization (there was some hand waving, perhaps
> > accurate waving, about memory barriers vs. CPU interrupts). But now the
> > RCU (non-IPI) implementation is more widely used than it used to be, the
> > issue is clearer.
> > 
> >>
> >> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
> >> From: David Hildenbrand <david@...hat.com>
> >> Date: Mon, 29 Aug 2022 16:57:07 +0200
> >> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
> >>  changed
> >>
> >> mm/ksm.c:write_protect_page() has to make sure that no unknown
> >> references to a mapped page exist and that no additional ones with write
> >> permissions are possible -- unknown references could have write permissions
> >> and modify the page afterwards.
> >>
> >> Conceptually, mm/ksm.c:write_protect_page() consists of:
> >>   (1) Clear/invalidate PTE
> >>   (2) Check if there are unknown references; back off if so.
> >>   (3) Update PTE (e.g., map it R/O)
> >>
> >> Conceptually, GUP-fast code consists of:
> >>   (1) Read the PTE
> >>   (2) Increment refcount/pincount of the mapped page
> >>   (3) Check if the PTE changed by re-reading it; back off if so.
> >>
> >> To make sure GUP-fast won't be able to grab additional references after
> >> clearing the PTE, but will properly detect the change and back off, we
> >> need a memory barrier between updating the recount/pincount and checking
> >> if it changed.
> >>
> >> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
> >> explicit smp_mb__after_atomic() after the atomic RMW operation to
> >> increment the refcount and pincount.
> >>
> >> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
> >> a memory barrier for flushing the TLB, so don't add another one for now.
> >>
> >> PageAnonExclusive handling requires further care and will be handled
> >> separately.
> >>
> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> >> Signed-off-by: David Hildenbrand <david@...hat.com>
> >> ---
> >>  mm/gup.c | 17 +++++++++++++++++
> >>  1 file changed, 17 insertions(+)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 5abdaf487460..0008b808f484 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >>  			goto pte_unmap;
> >>  		}
> >>  
> >> +		/*
> >> +		 * Update refcount/pincount before testing for changed PTE. This
> >> +		 * is required for code like mm/ksm.c:write_protect_page() that
> >> +		 * wants to make sure that a page has no unknown references
> >> +		 * after clearing the PTE.
> >> +		 */
> >> +		smp_mb__after_atomic();
> >> +
> >>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >>  			gup_put_folio(folio, 1, flags);
> >>  			goto pte_unmap;
> >> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> >>  	if (!folio)
> >>  		return 0;
> >>  
> >> +	/* See gup_pte_range(). */
> > 
> > Don't we usually also identify what each mb pairs with, in the comments? That would help.
> 
> Yeah, if only I could locate them reliably (as documented ptep_clear_flush() 
> should imply one I guess) ... but it will depend on the context.
> 
> 
> As I now have the attention of two people that understand that machinery,
> here goes the PageAnonExclusive thing I *think* should be correct.
> 
> The IPI-based mechanism really did make such synchronization with
> GUP-fast easier ...
> 
> 
> 
> From 8f91ef3555178149ad560b5424a9854b2ceee2d6 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@...hat.com>
> Date: Sat, 27 Aug 2022 10:44:13 +0200
> Subject: [PATCH] mm: rework PageAnonExclusive() interaction with GUP-fast
> 
> commit 6c287605fd56 (mm: remember exclusively mapped anonymous pages with
> PG_anon_exclusive) made sure that when PageAnonExclusive() has to be
> cleared during temporary unmapping of a page, that the PTE is
> cleared/invalidated and that the TLB is flushed.
> 
> That handling was inspired by an outdated comment in
> mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
> the past to synchronize with GUP-fast. However, ever since general RCU GUP
> fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
> get_user_pages_fast()"), a TLB flush is no longer sufficient and
> required to synchronize with concurrent GUP-fast
> 
> Peter pointed out, that TLB flush is not required, and looking into
> details it turns out that he's right. To synchronize with GUP-fast, it's
> sufficient to clear the PTE only: GUP-fast will either detect that the PTE
> changed or that PageAnonExclusive is not set and back off. However, we
> rely on a given memory order and should make sure that that order is
> always respected.
> 
> Conceptually, GUP-fast pinning code of anon pages consists of:
>   (1) Read the PTE
>   (2) Pin the mapped page
>   (3) Check if the PTE changed by re-reading it; back off if so.
>   (4) Check if PageAnonExclusive is not set; back off if so.
> 
> Conceptually, PageAnonExclusive clearing code consists of:
>   (1) Clear PTE
>   (2) Check if the page is pinned; back off if so.
>   (3) Clear PageAnonExclusive
>   (4) Restore PTE (optional)
> 
> As GUP-fast temporarily pins the page before validating whether the PTE
> changed, and PageAnonExclusive clearing code clears the PTE before
> checking if the page is pinned, GUP-fast cannot end up pinning an anon
> page that is not exclusive.
> 
> One corner case to consider is when we restore the PTE to the same value
> after PageAnonExclusive was cleared, as it can happen in
> mm/ksm.c:write_protect_page(). In that case, GUP-fast might not detect
> a PTE change (because there was none). However, as restoring the PTE
> happens after clearing PageAnonExclusive, GUP-fast would detect that
> PageAnonExclusive was cleared in that case and would properly back off.
> 
> Let's document that, avoid the TLB flush where possible and use proper
> explicit memory barriers where required. We shouldn't really care about the
> additional memory barriers here, as we're not on extremely hot paths.
> 
> The possible issues due to reordering are of theoretical nature so far,
> but it better be addressed.
> 
> Note that we don't need a memory barrier between checking if the page is
> pinned and clearing PageAnonExclusive, because stores are not
> speculated.
> 
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
>  include/linux/mm.h   |  9 +++++--
>  include/linux/rmap.h | 58 ++++++++++++++++++++++++++++++++++++++++----
>  mm/huge_memory.c     |  3 +++
>  mm/ksm.c             |  1 +
>  mm/migrate_device.c  | 22 +++++++----------
>  mm/rmap.c            | 11 +++++----
>  6 files changed, 79 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..f7e8f4b34fb5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2975,8 +2975,8 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
>   * PageAnonExclusive() has to protect against concurrent GUP:
>   * * Ordinary GUP: Using the PT lock
>   * * GUP-fast and fork(): mm->write_protect_seq
> - * * GUP-fast and KSM or temporary unmapping (swap, migration):
> - *   clear/invalidate+flush of the page table entry
> + * * GUP-fast and KSM or temporary unmapping (swap, migration): see
> + *    page_try_share_anon_rmap()
>   *
>   * Must be called with the (sub)page that's actually referenced via the
>   * page table entry, which might not necessarily be the head page for a
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
>  	 */
>  	if (!PageAnon(page))
>  		return false;
> +
> +	/* See page_try_share_anon_rmap() for GUP-fast details. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> +		smp_rmb();
> +
>  	/*
>  	 * Note that PageKsm() pages cannot be exclusive, and consequently,
>  	 * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
>   * @page: the exclusive anonymous page to try marking possibly shared
>   *
>   * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
>   *
>   * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
>   * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>  
> -	/* See page_try_dup_anon_rmap(). */
> -	if (likely(!is_device_private_page(page) &&
> -	    unlikely(page_maybe_dma_pinned(page))))
> -		return -EBUSY;
> +	/* device private pages cannot get pinned via GUP. */
> +	if (unlikely(is_device_private_page(page))) {
> +		ClearPageAnonExclusive(page);
> +		return 0;
> +	}
>  
> +	/*
> +	 * We have to make sure that while we clear PageAnonExclusive, that
> +	 * the page is not pinned and that concurrent GUP-fast won't succeed in
> +	 * concurrently pinning the page.
> +	 *
> +	 * Conceptually, GUP-fast pinning code of anon pages consists of:
> +	 * (1) Read the PTE
> +	 * (2) Pin the mapped page
> +	 * (3) Check if the PTE changed by re-reading it; back off if so.
> +	 * (4) Check if PageAnonExclusive is not set; back off if so.
> +	 *
> +	 * Conceptually, PageAnonExclusive clearing code consists of:
> +	 * (1) Clear PTE
> +	 * (2) Check if the page is pinned; back off if so.
> +	 * (3) Clear PageAnonExclusive
> +	 * (4) Restore PTE (optional)
> +	 *
> +	 * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> +	 * the right order. Memory order between (2) and (3) is handled by
> +	 * GUP-fast, independent of PageAnonExclusive.
> +	 *
> +	 * When clearing PageAnonExclusive(), we have to make sure that (1),
> +	 * (2), (3) and (4) happen in the right order.
> +	 *
> +	 * Note that (4) has to happen after (3) in both cases to handle the
> +	 * corner case whereby the PTE is restored to the original value after
> +	 * clearing PageAnonExclusive and while GUP-fast might not detect the
> +	 * PTE change, it will detect the PageAnonExclusive change.
> +	 *
> +	 * We assume that there might not be a memory barrier after
> +	 * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> +	 * so we use explicit ones here.
> +	 *
> +	 * These memory barriers are paired with memory barriers in GUP-fast
> +	 * code, including gup_must_unshare().
> +	 */
> +
> +	/* Clear/invalidate the PTE before checking for PINs. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> +		smp_mb();

Wondering whether this could be smp_mb__before_atomic().

> +
> +	if (unlikely(page_maybe_dma_pinned(page)))
> +		return -EBUSY;
>  	ClearPageAnonExclusive(page);
> +
> +	/* Clear PageAnonExclusive() before eventually restoring the PTE. */
> +	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> +		smp_mb__after_atomic();
>  	return 0;
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..2aef8d76fcf2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		 *
>  		 * In case we cannot clear PageAnonExclusive(), split the PMD
>  		 * only and let try_to_migrate_one() fail later.
> +		 *
> +		 * See page_try_share_anon_rmap(): invalidate PMD first.
>  		 */
>  		anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>  		if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
> @@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
>  	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>  
> +	/* See page_try_share_anon_rmap(): invalidate PMD first. */
>  	anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
>  	if (anon_exclusive && page_try_share_anon_rmap(page)) {
>  		set_pmd_at(mm, address, pvmw->pmd, pmdval);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d7526c705081..971cf923c0eb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
>  			goto out_unlock;
>  		}
>  
> +		/* See page_try_share_anon_rmap(): clear PTE first. */
>  		if (anon_exclusive && page_try_share_anon_rmap(page)) {
>  			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
>  			goto out_unlock;
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d65476..47e955212f15 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			bool anon_exclusive;
>  			pte_t swp_pte;
>  

flush_cache_page() missing here?

Better copy Alistair too when post formally since this will have a slight
conflict with the other thread.

> +			ptep_get_and_clear(mm, addr, ptep);
> +
> +			/* See page_try_share_anon_rmap(): clear PTE first. */
>  			anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> -			if (anon_exclusive) {
> -				flush_cache_page(vma, addr, pte_pfn(*ptep));
> -				ptep_clear_flush(vma, addr, ptep);
> -
> -				if (page_try_share_anon_rmap(page)) {
> -					set_pte_at(mm, addr, ptep, pte);
> -					unlock_page(page);
> -					put_page(page);
> -					mpfn = 0;
> -					goto next;
> -				}
> -			} else {
> -				ptep_get_and_clear(mm, addr, ptep);
> +			if (anon_exclusive && page_try_share_anon_rmap(page)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				unlock_page(page);
> +				put_page(page);
> +				mpfn = 0;
> +				goto next;
>  			}
>  
>  			migrate->cpages++;

Thanks,

-- 
Peter Xu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ