[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a4fe603-950e-785b-6281-2e309256463f@nvidia.com>
Date: Tue, 30 Aug 2022 12:18:41 -0700
From: John Hubbard <jhubbard@...dia.com>
To: David Hildenbrand <david@...hat.com>,
Jason Gunthorpe <jgg@...dia.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mgorman@...e.de>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Hugh Dickins <hughd@...gle.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v1 2/3] mm/gup: use gup_can_follow_protnone() also in
GUP-fast
On 8/30/22 11:53, David Hildenbrand wrote:
> Good, I managed to attract the attention of someone who understands that machinery :)
>
> While validating whether GUP-fast and PageAnonExclusive code work correctly,
> I started looking at the whole RCU GUP-fast machinery. I do have a patch to
> improve PageAnonExclusive clearing (I think we're missing memory barriers to
> make it work as expected in any possible case), but I also stumbled eventually
> over a more generic issue that might need memory barriers.
>
> Any thoughts whether I am missing something or this is actually missing
> memory barriers?
>
It's actually missing memory barriers.
In fact, others have had that same thought! [1] :) In that 2019 thread,
I recall that this got dismissed because of a focus on the IPI-based
aspect of gup fast synchronization (there was some hand waving, perhaps
accurate waving, about memory barriers vs. CPU interrupts). But now the
RCU (non-IPI) implementation is more widely used than it used to be, the
issue is clearer.
>
> From ce8c941c11d1f60cea87a3e4d941041dc6b79900 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@...hat.com>
> Date: Mon, 29 Aug 2022 16:57:07 +0200
> Subject: [PATCH] mm/gup: update refcount+pincount before testing if the PTE
> changed
>
> mm/ksm.c:write_protect_page() has to make sure that no unknown
> references to a mapped page exist and that no additional ones with write
> permissions are possible -- unknown references could have write permissions
> and modify the page afterwards.
>
> Conceptually, mm/ksm.c:write_protect_page() consists of:
> (1) Clear/invalidate PTE
> (2) Check if there are unknown references; back off if so.
> (3) Update PTE (e.g., map it R/O)
>
> Conceptually, GUP-fast code consists of:
> (1) Read the PTE
> (2) Increment refcount/pincount of the mapped page
> (3) Check if the PTE changed by re-reading it; back off if so.
>
> To make sure GUP-fast won't be able to grab additional references after
> clearing the PTE, but will properly detect the change and back off, we
> need a memory barrier between updating the recount/pincount and checking
> if it changed.
>
> try_grab_folio() doesn't necessarily imply a memory barrier, so add an
> explicit smp_mb__after_atomic() after the atomic RMW operation to
> increment the refcount and pincount.
>
> ptep_clear_flush() used to clear the PTE and flush the TLB should imply
> a memory barrier for flushing the TLB, so don't add another one for now.
>
> PageAnonExclusive handling requires further care and will be handled
> separately.
>
> Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
> mm/gup.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..0008b808f484 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2392,6 +2392,14 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> goto pte_unmap;
> }
>
> + /*
> + * Update refcount/pincount before testing for changed PTE. This
> + * is required for code like mm/ksm.c:write_protect_page() that
> + * wants to make sure that a page has no unknown references
> + * after clearing the PTE.
> + */
> + smp_mb__after_atomic();
> +
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;
> @@ -2577,6 +2585,9 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
Don't we usually also identify what each mb pairs with, in the comments? That would help.
> + smp_mb__after_atomic();
> +
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2643,6 +2654,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
> + smp_mb__after_atomic();
> +
> if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> gup_put_folio(folio, refs, flags);
> return 0;
> @@ -2683,6 +2697,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> if (!folio)
> return 0;
>
> + /* See gup_pte_range(). */
> + smp_mb__after_atomic();
> +
> if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> gup_put_folio(folio, refs, flags);
> return 0;
[1] https://lore.kernel.org/lkml/9465df76-0229-1b44-5646-5cced1bc1718@nvidia.com/
thanks,
--
John Hubbard
NVIDIA
Powered by blists - more mailing lists