[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yw+KnRTrZ74qFUAA@xz-m1.local>
Date: Wed, 31 Aug 2022 12:21:49 -0400
From: Peter Xu <peterx@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: John Hubbard <jhubbard@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mgorman@...e.de>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in
GUP-fast
On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
> On 30.08.22 21:18, John Hubbard wrote:
> > On 8/30/22 11:53, David Hildenbrand wrote:
> >> Good, I managed to attract the attention of someone who understands that machinery :)
> >>
> >> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> >> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> >> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> >> make it work as expected in any possible case), but I also stumbled eventually
> >> over a more generic issue that might need memory barriers.
> >>
> >> Any thoughts whether I am missing something or this is actually missing
> >> memory barriers?
> >>
> >
> > It's actually missing memory barriers.
> >
> > In fact, others have had that same thought! [1] :) In that 2019 thread,
> > I recall that this got dismissed because of a focus on the IPI-based
> > aspect of gup fast synchronization (there was some hand waving, perhaps
> > accurate waving, about memory barriers vs. CPU interrupts). But now the
> > RCU (non-IPI) implementation is more widely used than it used to be, the
> > issue is clearer.
> >
> >>
> >> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
> >> From: David Hildenbrand <david@...hat.com>
> >> Date: Mon, 29 Aug 2022 16:57:07 +0200
> >> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
> >> changed
> >>
> >> mm/ksm.c:write_protect_page() has to make sure that no unknown
> >> references to a mapped page exist and that no additional ones with write
> >> permissions are possible -- unknown references could have write permissions
> >> and modify the page afterwards.
> >>
> >> Conceptually, mm/ksm.c:write_protect_page() consists of:
> >> (1) Clear/invalidate PTE
> >> (2) Check if there are unknown references; back off if so.
> >> (3) Update PTE (e.g., map it R/O)
> >>
> >> Conceptually, GUP-fast code consists of:
> >> (1) Read the PTE
> >> (2) Increment refcount/pincount of the mapped page
> >> (3) Check if the PTE changed by re-reading it; back off if so.
> >>
> >> To make sure GUP-fast won't be able to grab additional references after
> >> clearing the PTE, but will properly detect the change and back off, we
> >> need a memory barrier between updating the recount/pincount and checking
> >> if it changed.
> >>
> >> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
> >> explicit smp_mb__after_atomic() after the atomic RMW operation to
> >> increment the refcount and pincount.
> >>
> >> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
> >> a memory barrier for flushing the TLB, so don't add another one for now.
> >>
> >> PageAnonExclusive handling requires further care and will be handled
> >> separately.
> >>
> >> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> >> Signed-off-by: David Hildenbrand <david@...hat.com>
> >> ---
> >> mm/gup.c | 17 +++++++++++++++++
> >> 1 file changed, 17 insertions(+)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 5abdaf487460..0008b808f484 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >> goto pte_unmap;
> >> }
> >>
> >> + /*
> >> + * Update refcount/pincount before testing for changed PTE. This
> >> + * is required for code like mm/ksm.c:write_protect_page() that
> >> + * wants to make sure that a page has no unknown references
> >> + * after clearing the PTE.
> >> + */
> >> + smp_mb__after_atomic();
> >> +
> >> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> >> gup_put_folio(folio, 1, flags);
> >> goto pte_unmap;
> >> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> >> if (!folio)
> >> return 0;
> >>
> >> + /* See gup_pte_range(). */
> >
> > Don't we usually also identify what each mb pairs with, in the comments? That would help.
>
> Yeah, if only I could locate them reliably (as documented ptep_clear_flush()
> should imply one I guess) ... but it will depend on the context.
>
>
> As I now have the attention of two people that understand that machinery,
> here goes the PageAnonExclusive thing I *think* should be correct.
>
> The IPI-based mechanism really did make such synchronization with
> GUP-fast easier ...
>
>
>
> From 8f91ef3555178149ad560b5424a9854b2ceee2d6 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@...hat.com>
> Date: Sat, 27 Aug 2022 10:44:13 +0200
> Subject: [PATCH] mm: rework PageAnonExclusive() interaction with GUP-fast
>
> commit 6c287605fd56 (mm: remember exclusively mapped anonymous pages with
> PG_anon_exclusive) made sure that when PageAnonExclusive() has to be
> cleared during temporary unmapping of a page, that the PTE is
> cleared/invalidated and that the TLB is flushed.
>
> That handling was inspired by an outdated comment in
> mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
> the past to synchronize with GUP-fast. However, ever since general RCU GUP
> fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU
> get_user_pages_fast()"), a TLB flush is no longer sufficient and
> required to synchronize with concurrent GUP-fast
>
> Peter pointed out, that TLB flush is not required, and looking into
> details it turns out that he's right. To synchronize with GUP-fast, it's
> sufficient to clear the PTE only: GUP-fast will either detect that the PTE
> changed or that PageAnonExclusive is not set and back off. However, we
> rely on a given memory order and should make sure that that order is
> always respected.
>
> Conceptually, GUP-fast pinning code of anon pages consists of:
> (1) Read the PTE
> (2) Pin the mapped page
> (3) Check if the PTE changed by re-reading it; back off if so.
> (4) Check if PageAnonExclusive is not set; back off if so.
>
> Conceptually, PageAnonExclusive clearing code consists of:
> (1) Clear PTE
> (2) Check if the page is pinned; back off if so.
> (3) Clear PageAnonExclusive
> (4) Restore PTE (optional)
>
> As GUP-fast temporarily pins the page before validating whether the PTE
> changed, and PageAnonExclusive clearing code clears the PTE before
> checking if the page is pinned, GUP-fast cannot end up pinning an anon
> page that is not exclusive.
>
> One corner case to consider is when we restore the PTE to the same value
> after PageAnonExclusive was cleared, as it can happen in
> mm/ksm.c:write_protect_page(). In that case, GUP-fast might not detect
> a PTE change (because there was none). However, as restoring the PTE
> happens after clearing PageAnonExclusive, GUP-fast would detect that
> PageAnonExclusive was cleared in that case and would properly back off.
>
> Let's document that, avoid the TLB flush where possible and use proper
> explicit memory barriers where required. We shouldn't really care about the
> additional memory barriers here, as we're not on extremely hot paths.
>
> The possible issues due to reordering are of theoretical nature so far,
> but it better be addressed.
>
> Note that we don't need a memory barrier between checking if the page is
> pinned and clearing PageAnonExclusive, because stores are not
> speculated.
>
> Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
> include/linux/mm.h | 9 +++++--
> include/linux/rmap.h | 58 ++++++++++++++++++++++++++++++++++++++++----
> mm/huge_memory.c | 3 +++
> mm/ksm.c | 1 +
> mm/migrate_device.c | 22 +++++++----------
> mm/rmap.c | 11 +++++----
> 6 files changed, 79 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..f7e8f4b34fb5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2975,8 +2975,8 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> * PageAnonExclusive() has to protect against concurrent GUP:
> * * Ordinary GUP: Using the PT lock
> * * GUP-fast and fork(): mm->write_protect_seq
> - * * GUP-fast and KSM or temporary unmapping (swap, migration):
> - * clear/invalidate+flush of the page table entry
> + * * GUP-fast and KSM or temporary unmapping (swap, migration): see
> + * page_try_share_anon_rmap()
> *
> * Must be called with the (sub)page that's actually referenced via the
> * page table entry, which might not necessarily be the head page for a
> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
> */
> if (!PageAnon(page))
> return false;
> +
> + /* See page_try_share_anon_rmap() for GUP-fast details. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
> + smp_rmb();
> +
> /*
> * Note that PageKsm() pages cannot be exclusive, and consequently,
> * cannot get pinned.
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bf80adca980b..454c159f2aae 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
> * @page: the exclusive anonymous page to try marking possibly shared
> *
> * The caller needs to hold the PT lock and has to have the page table entry
> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
> + * cleared/invalidated.
> *
> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
> * to duplicate a mapping, but instead to prepare for KSM or temporarily
> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
> {
> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>
> - /* See page_try_dup_anon_rmap(). */
> - if (likely(!is_device_private_page(page) &&
> - unlikely(page_maybe_dma_pinned(page))))
> - return -EBUSY;
> + /* device private pages cannot get pinned via GUP. */
> + if (unlikely(is_device_private_page(page))) {
> + ClearPageAnonExclusive(page);
> + return 0;
> + }
>
> + /*
> + * We have to make sure that while we clear PageAnonExclusive, that
> + * the page is not pinned and that concurrent GUP-fast won't succeed in
> + * concurrently pinning the page.
> + *
> + * Conceptually, GUP-fast pinning code of anon pages consists of:
> + * (1) Read the PTE
> + * (2) Pin the mapped page
> + * (3) Check if the PTE changed by re-reading it; back off if so.
> + * (4) Check if PageAnonExclusive is not set; back off if so.
> + *
> + * Conceptually, PageAnonExclusive clearing code consists of:
> + * (1) Clear PTE
> + * (2) Check if the page is pinned; back off if so.
> + * (3) Clear PageAnonExclusive
> + * (4) Restore PTE (optional)
> + *
> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
> + * the right order. Memory order between (2) and (3) is handled by
> + * GUP-fast, independent of PageAnonExclusive.
> + *
> + * When clearing PageAnonExclusive(), we have to make sure that (1),
> + * (2), (3) and (4) happen in the right order.
> + *
> + * Note that (4) has to happen after (3) in both cases to handle the
> + * corner case whereby the PTE is restored to the original value after
> + * clearing PageAnonExclusive and while GUP-fast might not detect the
> + * PTE change, it will detect the PageAnonExclusive change.
> + *
> + * We assume that there might not be a memory barrier after
> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
> + * so we use explicit ones here.
> + *
> + * These memory barriers are paired with memory barriers in GUP-fast
> + * code, including gup_must_unshare().
> + */
> +
> + /* Clear/invalidate the PTE before checking for PINs. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb();
Wondering whether this could be smp_mb__before_atomic().
> +
> + if (unlikely(page_maybe_dma_pinned(page)))
> + return -EBUSY;
> ClearPageAnonExclusive(page);
> +
> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
> + smp_mb__after_atomic();
> return 0;
> }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..2aef8d76fcf2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2140,6 +2140,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> *
> * In case we cannot clear PageAnonExclusive(), split the PMD
> * only and let try_to_migrate_one() fail later.
> + *
> + * See page_try_share_anon_rmap(): invalidate PMD first.
> */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
> @@ -3177,6 +3179,7 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
> pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
>
> + /* See page_try_share_anon_rmap(): invalidate PMD first. */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> if (anon_exclusive && page_try_share_anon_rmap(page)) {
> set_pmd_at(mm, address, pvmw->pmd, pmdval);
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d7526c705081..971cf923c0eb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1091,6 +1091,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> goto out_unlock;
> }
>
> + /* See page_try_share_anon_rmap(): clear PTE first. */
> if (anon_exclusive && page_try_share_anon_rmap(page)) {
> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
> goto out_unlock;
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d65476..47e955212f15 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -193,20 +193,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> bool anon_exclusive;
> pte_t swp_pte;
>
flush_cache_page() missing here?
Better copy Alistair too when post formally since this will have a slight
conflict with the other thread.
> + ptep_get_and_clear(mm, addr, ptep);
> +
> + /* See page_try_share_anon_rmap(): clear PTE first. */
> anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
> - if (anon_exclusive) {
> - flush_cache_page(vma, addr, pte_pfn(*ptep));
> - ptep_clear_flush(vma, addr, ptep);
> -
> - if (page_try_share_anon_rmap(page)) {
> - set_pte_at(mm, addr, ptep, pte);
> - unlock_page(page);
> - put_page(page);
> - mpfn = 0;
> - goto next;
> - }
> - } else {
> - ptep_get_and_clear(mm, addr, ptep);
> + if (anon_exclusive && page_try_share_anon_rmap(page)) {
> + set_pte_at(mm, addr, ptep, pte);
> + unlock_page(page);
> + put_page(page);
> + mpfn = 0;
> + goto next;
> }
>
> migrate->cpages++;
Thanks,
--
Peter Xu
Powered by blists - more mailing lists