linux-kernel - Re: [PATCH mm-new v7 4/5] mm: khugepaged: skip lazy-free folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yEfgipUe37_k5rArrYMPY_31JUKQGjRk+NNJTK9QhBWQ@mail.gmail.com>
Date: Sun, 8 Feb 2026 06:01:51 +0800
From: Barry Song <21cnbao@...il.com>
To: "David Hildenbrand (Arm)" <david@...nel.org>
Cc: Lance Yang <lance.yang@...ux.dev>, Vernon Yang <vernon2gm@...il.com>, akpm@...ux-foundation.org, 
	lorenzo.stoakes@...cle.com, ziy@...dia.com, dev.jain@....com, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Vernon Yang <yanglincheng@...inos.cn>
Subject: Re: [PATCH mm-new v7 4/5] mm: khugepaged: skip lazy-free folios

On Sun, Feb 8, 2026 at 5:38 AM David Hildenbrand (Arm) <david@...nel.org> wrote:
>
> On 2/7/26 14:51, Lance Yang wrote:
> >
> >
> > On 2026/2/7 16:34, Barry Song wrote:
> >> On Sat, Feb 7, 2026 at 4:16 PM Vernon Yang <vernon2gm@...il.com> wrote:
> >>>
> >>> From: Vernon Yang <yanglincheng@...inos.cn>
> >>>
> >>> For example, create three task: hot1 -> cold -> hot2. After all three
> >>> task are created, each allocate memory 128MB. the hot1/hot2 task
> >>> continuously access 128 MB memory, while the cold task only accesses
> >>> its memory briefly and then call madvise(MADV_FREE). However, khugepaged
> >>> still prioritizes scanning the cold task and only scans the hot2 task
> >>> after completing the scan of the cold task.
> >>>
> >>> And if we collapse with a lazyfree page, that content will never be none
> >>> and the deferred shrinker cannot reclaim them.
> >>>
> >>> So if the user has explicitly informed us via MADV_FREE that this memory
> >>> will be freed, it is appropriate for khugepaged to skip it only, thereby
> >>> avoiding unnecessary scan and collapse operations to reducing CPU
> >>> wastage.
> >>>
> >>> Here are the performance test results:
> >>> (Throughput bigger is better, other smaller is better)
> >>>
> >>> Testing on x86_64 machine:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> >>> | cycles per access   |  4.96         |  2.21         | -55.44% |
> >>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> >>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >>>
> >>> Testing on qemu-system-x86_64 -enable-kvm:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> >>> | cycles per access   |  7.29         |  2.07         | -71.60% |
> >>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> >>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@...inos.cn>
> >>> Acked-by: David Hildenbrand (arm) <david@...nel.org>
> >>> Reviewed-by: Lance Yang <lance.yang@...ux.dev>
> >>> ---
> >>>   include/trace/events/huge_memory.h |  1 +
> >>>   mm/khugepaged.c                    | 13 +++++++++++++
> >>>   2 files changed, 14 insertions(+)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/
> >>> events/huge_memory.h
> >>> index 384e29f6bef0..bcdc57eea270 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -25,6 +25,7 @@
> >>>          EM( SCAN_PAGE_LRU,
> >>> "page_not_in_lru")              \
> >>>          EM( SCAN_PAGE_LOCK,
> >>> "page_locked")                  \
> >>>          EM( SCAN_PAGE_ANON,
> >>> "page_not_anon")                \
> >>> +       EM( SCAN_PAGE_LAZYFREE,
> >>> "page_lazyfree")                \
> >>>          EM( SCAN_PAGE_COMPOUND,
> >>> "page_compound")                \
> >>>          EM( SCAN_ANY_PROCESS,
> >>> "no_process_for_page")          \
> >>>          EM( SCAN_VMA_NULL,
> >>> "vma_null")                     \
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 8b68ae3bc2c5..0d160e612e16 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -46,6 +46,7 @@ enum scan_result {
> >>>          SCAN_PAGE_LRU,
> >>>          SCAN_PAGE_LOCK,
> >>>          SCAN_PAGE_ANON,
> >>> +       SCAN_PAGE_LAZYFREE,
> >>>          SCAN_PAGE_COMPOUND,
> >>>          SCAN_ANY_PROCESS,
> >>>          SCAN_VMA_NULL,
> >>> @@ -583,6 +584,12 @@ static enum scan_result
> >>> __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>                  folio = page_folio(page);
> >>>                  VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>
> >>> +               if (cc->is_khugepaged && !pte_dirty(pteval) &&
> >>> +                   folio_test_lazyfree(folio)) {
> >>
> >> We have two corner cases here:
> >
> > Good catch!
> >
> >>
> >> 1. Even if a lazyfree folio is dirty, if the VMA has the VM_DROPPABLE
> >> flag,
> >> a lazyfree folio may still be dropped, even when its PTE is dirty.
>
> Good point!
>
> >
> > Right. When the VMA has VM_DROPPABLE, we would drop the lazyfree folio
> > regardless of whether it (or the PTE) is dirty in try_to_unmap_one().
> >
> > So, IMHO, we could go with:
> >
> > cc->is_khugepaged && folio_test_lazyfree(folio) &&
> >      (!pte_dirty(pteval) || (vma->vm_flags & VM_DROPPABLE))
>
> Hm. In a VM_DROPPABLE mapping all folios should be marked as lazy-free
> (see folio_add_new_anon_rmap()).
>
> The new (collapse) folio will also be marked lazy (due to
> folio_add_new_anon_rmap()) free and can just get dropped any time.
>
> So likely we should just not skip collapse for lazyfree folios in
> VM_DROPPABLE mappings?

Maybe change “just not skip” to “just skip”?

If the goal is to avoid the collapse overhead for folios that are
about to be dropped, we might consider skipping collapse for the
entire VMA？

>
> if (cc->is_khugepaged && !(vma->vm_flags & VM_DROPPABLE) &&
>      folio_test_lazyfree(folio) && !pte_dirty(pteval)) {
>         ...
> }

Thanks
Barry