linux-kernel - Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zera4+bWuXHKjdU3QdiR3sqcUQB3gF08DPD49OCT4S9w@mail.gmail.com>
Date: Mon, 26 Feb 2024 17:00:43 +1300
From: Barry Song <21cnbao@...il.com>
To: Lance Yang <ioworker0@...il.com>
Cc: akpm@...ux-foundation.org, zokeefe@...gle.com, shy828301@...il.com, 
	david@...hat.com, mhocko@...e.com, ryan.roberts@....com, 
	wangkefeng.wang@...wei.com, songmuchun@...edance.com, peterx@...hat.com, 
	minchan@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

Hi Lance,


On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@...il.com> wrote:
>
> This patch improves madvise_free_pte_range() to correctly
> handle large folio that is smaller than PMD-size
> (for example, 16KiB to 1024KiB[1]). It’s probably part of
> the preparation to support anonymous multi-size THP.
>
> Additionally, when the consecutive PTEs are mapped to
> consecutive pages of the same large folio (mTHP), if the
> folio is locked before madvise(MADV_FREE) or cannot be
> split, then all subsequent PTEs within the same PMD will
> be skipped. However, they should have been MADV_FREEed.
>
> Moreover, this patch also optimizes lazyfreeing with
> PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
> aim to avoid unnecessary folio splitting if the large
> folio is entirely within the given range.
>

We did something similar on MADV_PAGEOUT[1]

[1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/


> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
>
> Folio Size  |    Old     |    New     |  Change
> ----------------------------------------------
>       4KiB  |  0.590251  |  0.590264  |     0%
>      16KiB  |  2.990447  |  0.182167  |   -94%
>      32KiB  |  2.547831  |  0.101622  |   -96%
>      64KiB  |  2.457796  |  0.049726  |   -98%
>     128KiB  |  2.281034  |  0.030109  |   -99%
>     256KiB  |  2.230387  |  0.015838  |   -99%
>     512KiB  |  2.189106  |  0.009149  |   -99%
>    1024KiB  |  2.183949  |  0.006620  |   -99%
>    2048KiB  |  0.002799  |  0.002795  |     0%
>
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhatcom/
>
> Signed-off-by: Lance Yang <ioworker0@...il.com>
> ---
>  mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index cfa5e7288261..bcbf56595a2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                  */
>                 if (folio_test_large(folio)) {
>                         int err;
> +                       unsigned long next_addr, align;
>
> -                       if (folio_estimated_sharers(folio) != 1)
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> +                       if (folio_estimated_sharers(folio) != 1 ||
> +                           !folio_trylock(folio))
> +                               goto skip_large_folio;
> +
> +                       align = folio_nr_pages(folio) * PAGE_SIZE;
> +                       next_addr = ALIGN_DOWN(addr + align, align);
> +
> +                       /*
> +                        * If we mark only the subpages as lazyfree,
> +                        * split the large folio.
> +                        */
> +                       if (next_addr > end || next_addr - addr != align)
> +                               goto split_large_folio;
> +
> +                       /*
> +                        * Avoid unnecessary folio splitting if the large
> +                        * folio is entirely within the given range.
> +                        */
> +                       folio_test_clear_dirty(folio);
> +                       folio_unlock(folio);
> +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> +                               ptent = ptep_get(pte);
> +                               if (pte_young(ptent) || pte_dirty(ptent)) {
> +                                       ptent = ptep_get_and_clear_full(
> +                                               mm, addr, pte, tlb->fullmm);
> +                                       ptent = pte_mkold(ptent);
> +                                       ptent = pte_mkclean(ptent);
> +                                       set_pte_at(mm, addr, pte, ptent);
> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                               }

The code works under the assumption the large folio is entirely mapped
in all PTEs in the range. This is not always true.

This won't work in some cases as some PTEs might be mapping to the
large folios. some others might have been unmapped or mapped
to different folios.

so in MADV_PAGEOUT, we have a function to check the folio is
really entirely mapped:

+static inline bool pte_range_cont_mapped(unsigned long start_pfn,
+ pte_t *start_pte, unsigned long start_addr, int nr)
+{
+              int i;
+              pte_t pte_val;
+
+              for (i = 0; i < nr; i++) {
+                           pte_val = ptep_get(start_pte + i);
+
+                           if (pte_none(pte_val))
+                                        return false;
+
+                           if (pte_pfn(pte_val) != (start_pfn + i))
+                                        return false;
+              }
+
+              return true;
+}

> +                       }
> +                       folio_mark_lazyfree(folio);
> +                       goto next_folio;
> +
> +split_large_folio:
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> +
> +                       /*
> +                        * If the large folio is locked before madvise(MADV_FREE)
> +                        * or cannot be split, we just skip it.
> +                        */
> +                       if (err) {
> +skip_large_folio:
> +                               if (next_addr >= end)
> +                                       break;
> +                               pte += (next_addr - addr) / PAGE_SIZE;
> +                               addr = next_addr;
> +                       }
> +
> +                       if (!start_pte) {
> +                               start_pte = pte = pte_offset_map_lock(
> +                                       mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                       }
> +
> +next_folio:
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
> --
> 2.33.1
>
>

Thanks
Barry