linux-kernel - Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wbXkfaAn+79g20SfE-0Ak4QACVP+Mw2vAvMnxBCcLAsQ@mail.gmail.com>
Date: Mon, 11 Aug 2025 11:55:36 +0800
From: Barry Song <21cnbao@...il.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: akpm@...ux-foundation.org, aarcange@...hat.com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, ngeoffray@...gle.com, 
	Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v4] userfaultfd: opportunistic TLB-flush batching for
 present pages in MOVE

Hi Lokesh,


On Sun, Aug 10, 2025 at 2:29 PM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
>
> MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> for moving present pages. Mitigate this cost by opportunistically
> batching present contiguous pages for TLB flushing.
>
> Without batching, in our testing on an arm64 Android device with UFFD GC,
> which uses MOVE ioctl for compaction, we observed that out of the total
> time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> ~20% in vm_normal_folio().
>
> With batching, the proportion of vm_normal_folio() increases to over
> 70% of move_pages_pte() without any changes to vm_normal_folio().
> Furthermore, time spent within move_pages_pte() is only ~20%, which
> includes TLB-flush overhead.
>
> Cc: Suren Baghdasaryan <surenb@...gle.com>
> Cc: Kalesh Singh <kaleshsingh@...gle.com>
> Cc: Barry Song <v-songbaohua@...o.com>
> Cc: David Hildenbrand <david@...hat.com>
> Cc: Peter Xu <peterx@...hat.com>
> Signed-off-by: Lokesh Gidra <lokeshgidra@...gle.com>
> ---
> Changes since v3 [1]
> - Fix unintialized 'step_size' warning, per Dan Carpenter
> - Removed pmd_none() from check_ptes_for_batched_move(), per Peter Xu
> - Removed flush_cache_range() in zero-page case, per Peter Xu
> - Added comment to explain why folio reference for batched pages is not
>   required, per Peter Xu
> - Use MIN() in calculation of largest extent that can be batched under
>   the same src and dst PTLs, per Peter Xu
> - Release first folio's reference in move_present_ptes(), per Peter Xu
>
> Changes since v2 [2]
> - Addressed VM_WARN_ON failure, per Lorenzo Stoakes
> - Added check to ensure all batched pages share the same anon_vma
>
> Changes since v1 [3]
> - Removed flush_tlb_batched_pending(), per Barry Song
> - Unified single and multi page case, per Barry Song
>
> [1] https://lore.kernel.org/all/20250807103902.2242717-1-lokeshgidra@google.com/
> [2] https://lore.kernel.org/all/20250805121410.1658418-1-lokeshgidra@google.com/
> [3] https://lore.kernel.org/all/20250731104726.103071-1-lokeshgidra@google.com/
>
>  mm/userfaultfd.c | 178 +++++++++++++++++++++++++++++++++--------------
>  1 file changed, 127 insertions(+), 51 deletions(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index cbed91b09640..39d81d2972db 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1026,18 +1026,64 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
>                pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
>  }
>
> -static int move_present_pte(struct mm_struct *mm,
> -                           struct vm_area_struct *dst_vma,
> -                           struct vm_area_struct *src_vma,
> -                           unsigned long dst_addr, unsigned long src_addr,
> -                           pte_t *dst_pte, pte_t *src_pte,
> -                           pte_t orig_dst_pte, pte_t orig_src_pte,
> -                           pmd_t *dst_pmd, pmd_t dst_pmdval,
> -                           spinlock_t *dst_ptl, spinlock_t *src_ptl,
> -                           struct folio *src_folio)
> +/*
> + * Checks if the two ptes and the corresponding folio are eligible for batched
> + * move. If so, then returns pointer to the locked folio. Otherwise, returns NULL.
> + *
> + * NOTE: folio's reference is not required as the whole operation is within
> + * PTL's critical section.
> + */
> +static struct folio *check_ptes_for_batched_move(struct vm_area_struct *src_vma,
> +                                                unsigned long src_addr,
> +                                                pte_t *src_pte, pte_t *dst_pte,
> +                                                struct anon_vma *src_anon_vma)
> +{
> +       pte_t orig_dst_pte, orig_src_pte;
> +       struct folio *folio;
> +
> +       orig_dst_pte = ptep_get(dst_pte);
> +       if (!pte_none(orig_dst_pte))
> +               return NULL;
> +
> +       orig_src_pte = ptep_get(src_pte);
> +       if (!pte_present(orig_src_pte) || is_zero_pfn(pte_pfn(orig_src_pte)))
> +               return NULL;
> +
> +       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> +       if (!folio || !folio_trylock(folio))
> +               return NULL;
> +       if (!PageAnonExclusive(&folio->page) || folio_test_large(folio) ||
> +           folio_anon_vma(folio) != src_anon_vma) {
> +               folio_unlock(folio);
> +               return NULL;
> +       }
> +       return folio;
> +}
> +

I’m still quite confused by the code. Before move_present_ptes(), we’ve
already performed all the checks—pte_same(), vm_normal_folio(),
folio_trylock(), folio_test_large(), folio_get_anon_vma(),
and anon_vma_lock_write()—at least for the first PTE. Now we’re
duplicating them again for all PTEs. Does this mean we’re doing those
operations for the first PTE twice? It feels like the old non-batch check
code should be removed?

> +static long move_present_ptes(struct mm_struct *mm,
> +                             struct vm_area_struct *dst_vma,
> +                             struct vm_area_struct *src_vma,
> +                             unsigned long dst_addr, unsigned long src_addr,
> +                             pte_t *dst_pte, pte_t *src_pte,
> +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> +                             struct folio **first_src_folio, unsigned long len,
> +                             struct anon_vma *src_anon_vma)
>  {
>         int err = 0;
> +       struct folio *src_folio = *first_src_folio;
> +       unsigned long src_start = src_addr;
> +       unsigned long addr_end;
> +
> +       if (len > PAGE_SIZE) {
> +               addr_end = (dst_addr + PMD_SIZE) & PMD_MASK;
> +               len = MIN(addr_end - dst_addr, len);
>
> +               addr_end = (src_addr + PMD_SIZE) & PMD_MASK;
> +               len = MIN(addr_end - src_addr, len);
> +       }

We already have a pmd_addr_end() helper—can we reuse it?

[...]

>         /*
> @@ -1257,7 +1327,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                 if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES))
>                         err = -ENOENT;
>                 else /* nothing to do to move a hole */
> -                       err = 0;
> +                       err = PAGE_SIZE;

To be honest, I find `err = PAGE_SIZE` quite odd :-) Could we refine the
code to make it more readable?

[...]

> @@ -1857,10 +1930,13 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>                                 break;
>                         }
>
> -                       err = move_pages_pte(mm, dst_pmd, src_pmd,
> -                                            dst_vma, src_vma,
> -                                            dst_addr, src_addr, mode);
> -                       step_size = PAGE_SIZE;
> +                       ret = move_pages_ptes(mm, dst_pmd, src_pmd,
> +                                             dst_vma, src_vma, dst_addr,
> +                                             src_addr, src_end - src_addr, mode);
> +                       if (ret < 0)
> +                               err = ret;
> +                       else
> +                               step_size = ret;

also looks a bit strange :-)

>                 }
>
>                 cond_resched();
>
> base-commit: 561c80369df0733ba0574882a1635287b20f9de2
> --
> 2.50.1.703.g449372360f-goog
>

Thanks
Barry