[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac02593b-9272-df8c-120f-611b5489458a@redhat.com>
Date: Wed, 31 Aug 2022 09:44:56 +0200
From: David Hildenbrand <david@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: John Hubbard <jhubbard@...dia.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mgorman@...e.de>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Hugh Dickins <hughd@...gle.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in
GUP-fast
On 31.08.22 01:44, Jason Gunthorpe wrote:
> On Tue, Aug 30, 2022 at 09:23:44PM +0200, David Hildenbrand wrote:
>> @@ -2997,6 +2997,11 @@ static inline bool gup_must_unshare(unsigned int flags, struct page *page)
>> */
>> if (!PageAnon(page))
>> return false;
>> +
>> + /* See page_try_share_anon_rmap() for GUP-fast details. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && irqs_disabled())
>> + smp_rmb();
>> +
>> /*
>> * Note that PageKsm() pages cannot be exclusive, and consequently,
>> * cannot get pinned.
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index bf80adca980b..454c159f2aae 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -267,7 +267,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
>> * @page: the exclusive anonymous page to try marking possibly shared
>> *
>> * The caller needs to hold the PT lock and has to have the page table entry
>> - * cleared/invalidated+flushed, to properly sync against GUP-fast.
>> + * cleared/invalidated.
>> *
>> * This is similar to page_try_dup_anon_rmap(), however, not used during fork()
>> * to duplicate a mapping, but instead to prepare for KSM or temporarily
>> @@ -283,12 +283,60 @@ static inline int page_try_share_anon_rmap(struct page *page)
>> {
>> VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
>>
>> - /* See page_try_dup_anon_rmap(). */
>> - if (likely(!is_device_private_page(page) &&
>> - unlikely(page_maybe_dma_pinned(page))))
>> - return -EBUSY;
>> + /* device private pages cannot get pinned via GUP. */
>> + if (unlikely(is_device_private_page(page))) {
>> + ClearPageAnonExclusive(page);
>> + return 0;
>> + }
>>
>> + /*
>> + * We have to make sure that while we clear PageAnonExclusive, that
>> + * the page is not pinned and that concurrent GUP-fast won't succeed in
>> + * concurrently pinning the page.
>> + *
>> + * Conceptually, GUP-fast pinning code of anon pages consists of:
>> + * (1) Read the PTE
>> + * (2) Pin the mapped page
>> + * (3) Check if the PTE changed by re-reading it; back off if so.
>> + * (4) Check if PageAnonExclusive is not set; back off if so.
>> + *
>> + * Conceptually, PageAnonExclusive clearing code consists of:
>> + * (1) Clear PTE
>> + * (2) Check if the page is pinned; back off if so.
>> + * (3) Clear PageAnonExclusive
>> + * (4) Restore PTE (optional)
>> + *
>> + * In GUP-fast, we have to make sure that (2),(3) and (4) happen in
>> + * the right order. Memory order between (2) and (3) is handled by
>> + * GUP-fast, independent of PageAnonExclusive.
>> + *
>> + * When clearing PageAnonExclusive(), we have to make sure that (1),
>> + * (2), (3) and (4) happen in the right order.
>> + *
>> + * Note that (4) has to happen after (3) in both cases to handle the
>> + * corner case whereby the PTE is restored to the original value after
>> + * clearing PageAnonExclusive and while GUP-fast might not detect the
>> + * PTE change, it will detect the PageAnonExclusive change.
>> + *
>> + * We assume that there might not be a memory barrier after
>> + * clearing/invalidating the PTE (1) and before restoring the PTE (4),
>> + * so we use explicit ones here.
>> + *
>> + * These memory barriers are paired with memory barriers in GUP-fast
>> + * code, including gup_must_unshare().
>> + */
>> +
>> + /* Clear/invalidate the PTE before checking for PINs. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb();
>> +
>> + if (unlikely(page_maybe_dma_pinned(page)))
>> + return -EBUSY;
>
> It is usually a bad sign to see an attempt to create a "read release"..
I still have to get used to the acquire/release semantics ... :)
>
>> ClearPageAnonExclusive(page);
>> +
>> + /* Clear PageAnonExclusive() before eventually restoring the PTE. */
>> + if (IS_ENABLED(CONFIG_HAVE_FAST_GUP))
>> + smp_mb__after_atomic();
>> return 0;
>> }
>
> I don't know enough about the memory model to say if this is OK..
I guess it's best to include some memory model folks once we have something
that looks reasonable.
>
> Generally, I've never seen an algorithm be successfull with these
> kinds of multi-atomic gyrations.
Yeah, I'm absolutely looking for a nicer alternative to sync with
RCU GUP-fast. So far I wasn't successful.
>
> If we break it down a bit, and replace the 'read release' with an
> actual atomic for discussion:
>
>
> CPU0 CPU1
> clear pte
> incr_return ref // release & acquire
> add_ref // acquire
>
> This seems OK, if CPU1 views !dma then CPU0 must view clear pte due to
> the atomic's release/acquire semantic
>
> If CPU1 views dma then it just exits
>
>
> Now the second phase:
>
> CPU0 CPU1
> clear anon_exclusive
> restore pte // release
>
> read_pte // acquire
> read anon_exclusive
>
> If CPU0 observes the restored PTE then it must observe the cleared
> anon_exclusive
>
> Otherwise CPU0 must observe the cleared PTE.
>
> So, maybe I could convince myself it is OK, but I think your placement
> of barriers is confusing as to what data the barrier is actually
> linked to.
>
> We are using a barrier around the ref - acquire on the CPU0 and full
> barier on the CPU1 (eg atomic_read(); smb_mb_after_atomic() )
When dropping the other patch, I think we still need something like
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..8c5ff41d5e56 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -158,6 +158,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
else
folio_ref_add(folio,
refs * (GUP_PIN_COUNTING_BIAS - 1));
+ /*
+ * Adjust the pincount before re-checking the PTE for changes.
+ *
+ * Paired with a memory barrier in page_try_share_anon_rmap().
+ */
+ smb_mb__after_atomic();
+
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
return folio;
>
> The second phase uses a smp_store_release/load_acquire on the PTE.
>
> It is the same barriers you sketched but integrated with the data they
> are ordering.
Sorry for having to ask, but what exactly would be your suggestion?
Thanks for having a look!
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists