linux-kernel - Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+EESO4JFCR5P9PFoY2zo+X1Y-qv+-yy8X887isoqXwfQBtn1Q@mail.gmail.com>
Date: Tue, 12 Aug 2025 08:50:34 -0700
From: Lokesh Gidra <lokeshgidra@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, aarcange@...hat.com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, ngeoffray@...gle.com, 
	Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for
 present pages in MOVE

On Sun, Aug 10, 2025 at 8:55 PM Barry Song <21cnbao@...il.com> wrote:
>
> Hi Lokesh,
>
>
> On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> >
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> >
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> >
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > includes TLB-flush overhead.
> >
> > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > Cc: Kalesh Singh <kaleshsingh@...gle.com>
> > Cc: Barry Song <v-songbaohua@...o.com>
> > Cc: David Hildenbrand <david@...hat.com>
> > Cc: Peter Xu <peterx@...hat.com>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@...gle.com>
> > ---
> > Changes since v3 [1]
> > - Fix unintialized 'step_size' warning, per Dan Carpenter
> > - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> > - Removed flush_cache_range() in zero-page case, per Peter Xu
> > - Added comment to explain why folio reference for batched pages is not
> >   required, per Peter Xu
> > - Use MIN() in calculation of largest extent that can be batched under
> >   the same src and dst PTLs, per Peter Xu
> > - Release first folio's reference in move_present_ptes(), per Peter Xu
> >
> > Changes since v2 [2]
> > - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> > - Added check to ensure all batched pages share the same anon_vma
> >
> > Changes since v1 [3]
> > - Removed flush_tlb_batched_pending(), per Barry Song
> > - Unified single and multi page case, per Barry Song
> >
> > [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> > [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> > [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
> >
> >  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 127 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index cbed91b09640..39d81d2972db 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
> >                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
> >  }
> >
> > -static int move_present_pte(struct mm_struct *mm,
> > -                           struct vm_area_struct *dst_vma,
> > -                           struct vm_area_struct *src_vma,
> > -                           unsigned long dst_addr, unsigned long src_addr,
> > -                           pte_t *dst_pte, pte_t *src_pte,
> > -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> > -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> > -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > -                           struct folio *src_folio)
> > +/*
> > + * Checks if the two ptes and the corresponding folio are eligible for batched
> > + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> > + *
> > + * NOTE: folio's reference is not required as the whole operation is within
> > + * PTL's critical section.
> > + */
> > +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> > +                                                unsigned long src_addr,
> > +                                                pte_t *src_pte, pte_t *dst_pte,
> > +                                                struct anon_vma *src_anon_vma)
> > +{
> > +       pte_t orig_dst_pte, orig_src_pte;
> > +       struct folio *folio;
> > +
> > +       orig_dst_pte = ptep_get(dst_pte);
> > +       if (!pte_none(orig_dst_pte))
> > +               return NULL;
> > +
> > +       orig_src_pte = ptep_get(src_pte);
> > +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> > +               return NULL;
> > +
> > +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > +       if (!folio || !folio_trylock(folio))
> > +               return NULL;
> > +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> > +           folio_anon_vma(folio) != src_anon_vma) {
> > +               folio_unlock(folio);
> > +               return NULL;
> > +       }
> > +       return folio;
> > +}
> > +
>
> I’m still quite confused by the code. Before move_present_ptes(), we’ve
> already performed all the checks—pte_same(), vm_normal_folio(),
> folio_trylock(), folio_test_large(), folio_get_anon_vma(),
> and anon_vma_lock_write()—at least for the first PTE. Now we’re
> duplicating them again for all PTEs. Does this mean we’re doing those
> operations for the first PTE twice? It feels like the old non-batch check
> code should be removed?
>
> > +static long move_present_ptes(struct mm_struct *mm,
> > +                             struct vm_area_struct *dst_vma,
> > +                             struct vm_area_struct *src_vma,
> > +                             unsigned long dst_addr, unsigned long src_addr,
> > +                             pte_t *dst_pte, pte_t *src_pte,
> > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                             struct folio **first_src_folio, unsigned long len,
> > +                             struct anon_vma *src_anon_vma)
> >  {
> >         int err = 0;
> > +       struct folio *src_folio = *first_src_folio;
> > +       unsigned long src_start = src_addr;
> > +       unsigned long addr_end;
> > +
> > +       if (len > PAGE_SIZE) {
> > +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - dst_addr, len);
> >
> > +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> > +               len = MIN(addr_end - src_addr, len);
> > +       }
>
> We already have a pmd_addr_end() helper—can we reuse it?
>
> [...]
>
> >         /*
> > @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
> >                         err = -ENOENT;
> >                 else /* nothing to do to move a hole */
> > -                       err = 0;
> > +                       err = PAGE_SIZE;
>
> To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
> code to make it more readable?
>
Agreed! I'll replace 'err' with 'ret' as the function no longer only
returns error but also bytes-moved if there is no error.

> [...]
>
> > @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> >                                 break;
> >                         }
> >
> > -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> > -                                            dst_vma, src_vma,
> > -                                            dst_addr, src_addr, mode);
> > -                       step_size = PAGE_SIZE;
> > +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> > +                                             dst_vma, src_vma, dst_addr,
> > +                                             src_addr, src_end - src_addr, mode);
> > +                       if (ret < 0)
> > +                               err = ret;
> > +                       else
> > +                               step_size = ret;
>
> also looks a bit strange :-)

Any suggestions on how to improve this? 'step_size' is expected to be
different in each iteration of the loop even without this patch.
>
> >                 }
> >
> >                 cond_resched();
> >
> > base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> > --
> > 2.50.1.703.g449372360f-goog
> >
>
> Thanks
> Barry