linux-kernel - Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EESO763JtY3jjmgGGbJjSQcnJJFNpDs2iPaUwjP44VpyyzdQ@mail.gmail.com>
Date: Tue, 12 Aug 2025 08:44:44 -0700
From: Lokesh Gidra <lokeshgidra@...gle.com>
To: Peter Xu <peterx@...hat.com>
Cc: Barry Song <21cnbao@...il.com>, akpm@...ux-foundation.org, aarcange@...hat.com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, ngeoffray@...gle.com, 
	Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for
 present pages in MOVE

On Tue, Aug 12, 2025 at 7:44 AM Peter Xu <peterx@...hat.com> wrote:
>
> On Mon, Aug 11, 2025 at 11:55:36AM +0800, Barry Song wrote:
> > Hi Lokesh,
> >
> >
> > On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> > >
> > > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > > for moving present pages. Mitigate this cost by opportunistically
> > > batching present contiguous pages for TLB flushing.
> > >
> > > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > > which uses MOVE ioctl for compaction, we observed that out of the total
> > > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > > ~20% in vm_normal_folio().
> > >
> > > With batching, the proportion of vm_normal_folio() increases to over
> > > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > > includes TLB-flush overhead.
> > >
> > > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > > Cc: Kalesh Singh <kaleshsingh@...gle.com>
> > > Cc: Barry Song <v-songbaohua@...o.com>
> > > Cc: David Hildenbrand <david@...hat.com>
> > > Cc: Peter Xu <peterx@...hat.com>
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@...gle.com>
> > > ---
> > > Changes since v3 [1]
> > > - Fix unintialized 'step_size' warning, per Dan Carpenter
> > > - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> > > - Removed flush_cache_range() in zero-page case, per Peter Xu
> > > - Added comment to explain why folio reference for batched pages is not
> > >   required, per Peter Xu
> > > - Use MIN() in calculation of largest extent that can be batched under
> > >   the same src and dst PTLs, per Peter Xu
> > > - Release first folio's reference in move_present_ptes(), per Peter Xu
> > >
> > > Changes since v2 [2]
> > > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > > - Added check to ensure all batched pages share the same anon_vma
> > >
> > > Changes since v1 [3]
> > > - Removed flush_tlb_batched_pending(), per Barry Song
> > > - Unified single and multi page case, per Barry Song
> > >
> > > [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> > > [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > > [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> > >
> > >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> > >  1 file changed, 127 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index cbed91b09640..39d81d2972db 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> > >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> > >  }
> > >
> > > -static int move_present_pte(struct mm_struct *mm,
> > > -                           struct vm_area_struct *dst_vma,
> > > -                           struct vm_area_struct *src_vma,
> > > -                           unsigned long dst_addr, unsigned long src_addr,
> > > -                           pte_t *dst_pte, pte_t *src_pte,
> > > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > -                           struct folio *src_folio)
> > > +/*
> > > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > > + *
> > > + * NOTE: folio's reference is not required as the whole operation is within
> > > + * PTL's critical section.
> > > + */
> > > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > > +                                                unsigned long src_addr,
> > > +                                                pte_t *src_pte, pte_t *dst_pte,
> > > +                                                struct anon_vma *src_anon_vma)
> > > +{
> > > +       pte_t orig_dst_pte, orig_src_pte;
> > > +       struct folio *folio;
> > > +
> > > +       orig_dst_pte = ptep_get(dst_pte);
> > > +       if (!pte_none(orig_dst_pte))
> > > +               return NULL;
> > > +
> > > +       orig_src_pte = ptep_get(src_pte);
> > > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > > +               return NULL;
> > > +
> > > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > > +       if (!folio || !folio_trylock(folio))
> > > +               return NULL;
> > > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > > +           folio_anon_vma(folio) != src_anon_vma) {
> > > +               folio_unlock(folio);
> > > +               return NULL;
> > > +       }
> > > +       return folio;
> > > +}
> > > +
> >
> > I’m still quite confused by the code. Before move_present_ptes(), we’ve
> > already performed all the checks—pte_same(), vm_normal_folio(),
> > folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> > and anon_vma_lock_write()—at least for the first PTE. Now we’re
> > duplicating them again for all PTEs. Does this mean we’re doing those
> > operations for the first PTE twice? It feels like the old non-batch check
> > code should be removed?
>
> This function should only start to work on the 2nd (or more) continuous
> ptes to move within the same pgtable lock held.  We'll still need the
> original path because that was sleepable, this one isn't, and it's only
> best-effort fast path only. E.g. if trylock() fails above, it would
> fallback to the slow path.
>
Thanks Peter. I was about to give exactly the same reasoning :)
> >
> > > +static long move_present_ptes(struct mm_struct *mm,
> > > +                             struct vm_area_struct *dst_vma,
> > > +                             struct vm_area_struct *src_vma,
> > > +                             unsigned long dst_addr, unsigned long src_addr,
> > > +                             pte_t *dst_pte, pte_t *src_pte,
> > > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > > +                             struct folio **first_src_folio, unsigned long len,
> > > +                             struct anon_vma *src_anon_vma)
> > >  {
> > >         int err = 0;
> > > +       struct folio *src_folio = *first_src_folio;
> > > +       unsigned long src_start = src_addr;
> > > +       unsigned long addr_end;
> > > +
> > > +       if (len > PAGE_SIZE) {
> > > +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > > +               len = MIN(addr_end - dst_addr, len);
> > >
> > > +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > > +               len = MIN(addr_end - src_addr, len);
> > > +       }
> >
> > We already have a pmd_addr_end() helper—can we reuse it?
>
> I agree with Barry; I wante to say this version didn't use ALIGN() that I
> suggested but pmd_addr_end() looks better.

ALIGN() couldn't be used as we are calculating "how many bytes to the
next PMD" and not just align it. Anyways, pmd_addr_end() is definitely
better. Will do it in the next patch.
>
> Other than that this version looks good to me (plus the higher level
> performance results updated to the commit message, per requested in v3),
> thanks Lokesh.

Thanks Peter. I'll update the commit message in v5.
>
> --
> Peter Xu
>