[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com>
Date: Wed, 13 Aug 2025 17:03:02 +0800
From: Barry Song <21cnbao@...il.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: Peter Xu <peterx@...hat.com>, akpm@...ux-foundation.org, aarcange@...hat.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, ngeoffray@...gle.com,
Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>,
Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for
present pages in MOVE
On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
>
> On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@...hat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > Hi Lokesh,
[...]
> > > >
> > > > mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > > 1 file changed, 127 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index cbed91b09640..39d81d2972db 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > > pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > > }
> > > >
> > > > -static int move_present_pte(struct mm_struct *mm,
> > > > - struct vm_area_struct *dst_vma,
> > > > - struct vm_area_struct *src_vma,
> > > > - unsigned long dst_addr, unsigned long src_addr,
> > > > - pte_t *dst_pte, pte_t *src_pte,
> > > > - pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > - pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > - spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > - struct folio *src_folio)
> > > > +/*
> > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > + *
> > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > + * PTL's critical section.
> > > > + */
> > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > + unsigned long src_addr,
> > > > + pte_t *src_pte, pte_t *dst_pte,
> > > > + struct anon_vma *src_anon_vma)
> > > > +{
> > > > + pte_t orig_dst_pte, orig_src_pte;
> > > > + struct folio *folio;
> > > > +
> > > > + orig_dst_pte = ptep_get(dst_pte);
> > > > + if (!pte_none(orig_dst_pte))
> > > > + return NULL;
> > > > +
> > > > + orig_src_pte = ptep_get(src_pte);
> > > > + if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > + return NULL;
> > > > +
> > > > + folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > + if (!folio || !folio_trylock(folio))
> > > > + return NULL;
> > > > + if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > + folio_anon_vma(folio) != src_anon_vma) {
> > > > + folio_unlock(folio);
> > > > + return NULL;
> > > > + }
> > > > + return folio;
> > > > +}
> > > > +
> > >
> > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > operations for the first PTE twice? It feels like the old non-batch check
> > > code should be removed?
> >
> > This function should only start to work on the 2nd (or more) continuous
> > ptes to move within the same pgtable lock held. We'll still need the
> > original path because that was sleepable, this one isn't, and it's only
> > best-effort fast path only. E.g. if trylock() fails above, it would
> > fallback to the slow path.
> >
> Thanks Peter. I was about to give exactly the same reasoning :)
Apologies, I overlooked this part:
src_addr += PAGE_SIZE;
if (src_addr == addr_end)
break;
dst_addr += PAGE_SIZE;
dst_pte++;
src_pte++;
folio_unlock(src_folio);
src_folio = check_ptes_for_batched_move(src_vma,
src_addr, src_pte,
dst_pte, src_anon_vma);
I still find this a little tricky to follow — couldn’t we just handle it
like the other batched cases:
static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
struct page_vma_mapped_walk *pvmw,
enum ttu_flags flags, pte_t pte)
We pass the first PTE and use a function to determine how many PTEs we
can batch together. That way, we don’t need a special path for the first
PTE.
I guess the challenge is that the first PTE needs to handle
split_folio(), folio_trylock() with -EAGAIN, and
anon_vma_trylock_write(), while the other PTEs don’t?
If so, could we add a clear comment explaining that move_present_ptes()
moves PTEs that share the same anon_vma as the first PTE, are not large
folios, and can successfully take folio_trylock()?
If this condition isn’t met, the batch stops.
Thanks
Barry
Powered by blists - more mailing lists