linux-kernel - Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4y98H-8aK9r_5YrSPV=SCU=-rZf7YPMz32K0C8oFnUCNA@mail.gmail.com>
Date: Wed, 13 Aug 2025 17:03:02 +0800
From: Barry Song <21cnbao@...il.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: Peter Xu <peterx@...hat.com>, akpm@...ux-foundation.org, aarcange@...hat.com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, ngeoffray@...gle.com, 
	Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for
 present pages in MOVE

On Tue, Aug 12, 2025 at 11:44 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
>
> On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@...hat.com> wrote:
> >
> > On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > > Hi Lokesh,
[...]
> > > >
> > > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index cbed91b09640..39d81d2972db 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > > >  }
> > > >
> > > > -static int move_present_pte(struct mm_struct *mm,
> > > > -                           struct vm_area_struct *dst_vma,
> > > > -                           struct vm_area_struct *src_vma,
> > > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > > -                           struct folio *src_folio)
> > > > +/*
> > > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > > + *
> > > > + * NOTE: folio's reference is not required as the whole operation is within
> > > > + * PTL's critical section.
> > > > + */
> > > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > > +                                                unsigned long src_addr,
> > > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > > +                                                struct anon_vma *src_anon_vma)
> > > > +{
> > > > +       pte_t orig_dst_pte, orig_src_pte;
> > > > +       struct folio *folio;
> > > > +
> > > > +       orig_dst_pte = ptep_get(dst_pte);
> > > > +       if (!pte_none(orig_dst_pte))
> > > > +               return NULL;
> > > > +
> > > > +       orig_src_pte = ptep_get(src_pte);
> > > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > > +               return NULL;
> > > > +
> > > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > > +       if (!folio || !folio_trylock(folio))
> > > > +               return NULL;
> > > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > > +               folio_unlock(folio);
> > > > +               return NULL;
> > > > +       }
> > > > +       return folio;
> > > > +}
> > > > +
> > >
> > > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > > already performed all the checks—pte_same(), vm_normal_folio(),
> > > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > > duplicating them again for all PTEs. Does this mean we’re doing those
> > > operations for the first PTE twice? It feels like the old non-batch check
> > > code should be removed?
> >
> > This function should only start to work on the 2nd (or more) continuous
> > ptes to move within the same pgtable lock held.  We'll still need the
> > original path because that was sleepable, this one isn't, and it's only
> > best-effort fast path only. E.g. if trylock() fails above, it would
> > fallback to the slow path.
> >
> Thanks Peter. I was about to give exactly the same reasoning :)

Apologies, I overlooked this part:
                src_addr += PAGE_SIZE;
                if (src_addr == addr_end)
                        break;
                dst_addr += PAGE_SIZE;
                dst_pte++;
                src_pte++;
                folio_unlock(src_folio);
                src_folio = check_ptes_for_batched_move(src_vma,
src_addr, src_pte,
                                                        dst_pte, src_anon_vma);

I still find this a little tricky to follow — couldn’t we just handle it
like the other batched cases:

static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
                        struct page_vma_mapped_walk *pvmw,
                        enum ttu_flags flags, pte_t pte)

We pass the first PTE and use a function to determine how many PTEs we
can batch together. That way, we don’t need a special path for the first
PTE.

I guess the challenge is that the first PTE needs to handle
split_folio(), folio_trylock() with -EAGAIN, and
anon_vma_trylock_write(), while the other PTEs don’t?

If so, could we add a clear comment explaining that move_present_ptes()
moves PTEs that share the same anon_vma as the first PTE, are not large
folios, and can successfully take folio_trylock()?
If this condition isn’t met, the batch stops.

Thanks
Barry