linux-kernel - Re: [PATCH v5] userfaultfd: opportunistic TLB-flush batching for present pages in MOVE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGsJ_4wt_30C82B9_ZoOB2umBi-u_kE441ARvYoJVjkLtLqFCg@mail.gmail.com>
Date: Fri, 15 Aug 2025 22:11:24 +1200
From: Barry Song <21cnbao@...il.com>
To: Lokesh Gidra <lokeshgidra@...gle.com>
Cc: akpm@...ux-foundation.org, aarcange@...hat.com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, ngeoffray@...gle.com, 
	Suren Baghdasaryan <surenb@...gle.com>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, David Hildenbrand <david@...hat.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [PATCH v5] userfaultfd: opportunistic TLB-flush batching for
 present pages in MOVE

On Fri, Aug 15, 2025 at 9:44 PM Barry Song <21cnbao@...il.com> wrote:
>
> On Thu, Aug 14, 2025 at 7:30 AM Lokesh Gidra <lokeshgidra@...gle.com> wrote:
> >
> > MOVE ioctl's runtime is dominated by TLB-flush cost, which is required
> > for moving present pages. Mitigate this cost by opportunistically
> > batching present contiguous pages for TLB flushing.
> >
> > Without batching, in our testing on an arm64 Android device with UFFD GC,
> > which uses MOVE ioctl for compaction, we observed that out of the total
> > time spent in move_pages_pte(), over 40% is in ptep_clear_flush(), and
> > ~20% in vm_normal_folio().
> >
> > With batching, the proportion of vm_normal_folio() increases to over
> > 70% of move_pages_pte() without any changes to vm_normal_folio().
> > Furthermore, time spent within move_pages_pte() is only ~20%, which
> > includes TLB-flush overhead.
> >
> > When the GC intensive benchmark, which was used to gather the above
> > numbers, is run on cuttlefish (qemu android instance on x86_64), the
> > completion time of the benchmark went down from ~45mins to ~20mins.
> >
> > Furthermore, system_server, one of the most performance critical system
> > processes on android, saw over 50% reduction in GC compaction time on an
> > arm64 android device.
> >
> > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > Cc: Kalesh Singh <kaleshsingh@...gle.com>
> > Cc: Barry Song <v-songbaohua@...o.com>
> > Cc: David Hildenbrand <david@...hat.com>
> > Cc: Peter Xu <peterx@...hat.com>
> > Signed-off-by: Lokesh Gidra <lokeshgidra@...gle.com>
>
> Reviewed-by: Barry Song <baohua@...nel.org>
>
> [...]
> > +static long move_present_ptes(struct mm_struct *mm,
> > +                             struct vm_area_struct *dst_vma,
> > +                             struct vm_area_struct *src_vma,
> > +                             unsigned long dst_addr, unsigned long src_addr,
> > +                             pte_t *dst_pte, pte_t *src_pte,
> > +                             pte_t orig_dst_pte, pte_t orig_src_pte,
> > +                             pmd_t *dst_pmd, pmd_t dst_pmdval,
> > +                             spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                             struct folio **first_src_folio, unsigned long len,
> > +                             struct anon_vma *src_anon_vma)
> > +{
> > +       int err = 0;
> > +       struct folio *src_folio = *first_src_folio;
> > +       unsigned long src_start = src_addr;
> > +       unsigned long src_end;
> > +
> > +       if (len > PAGE_SIZE) {
> > +               len = pmd_addr_end(dst_addr, dst_addr + len) - dst_addr;
> > +               src_end = pmd_addr_end(src_addr, src_addr + len);
> > +       } else
> > +               src_end = src_addr + len;
>
> Nit:
>
> Look at Documentation/process/coding-style.rst.
>
> This does not apply if only one branch of a conditional statement is a single
> statement; in the latter case use braces in both branches:
>
> .. code-block:: c
>
>     if (condition) {
>         do_this();
>         do_that();
>     } else {
>         otherwise();
>     }
>
> By the way, what about the following for both cases? Would it impact
> performance in the `PAGE_SIZE` cases?
>
> len = pmd_addr_end(dst_addr, dst_addr + len) - dst_addr;
> src_end = pmd_addr_end(src_addr, src_addr + len);

By the way, do src and dst always have the same offset within a
single PMD? I don’t think so. If not, how can we verify that if
src’s PMD is not overflowing, dst is safe as well?

Have you only checked src? And for src, since you are already using
pmd_addr_end(), is src_end = src_addr + len fine? Why are you calling
pmd_addr_end twice after your first pmd_addr_end has already limited
the range?

Thanks
Barry