linux-kernel - [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240307061425.21013-1-ioworker0@gmail.com>
Date: Thu,  7 Mar 2024 14:14:25 +0800
From: Lance Yang <ioworker0@...il.com>
To: akpm@...ux-foundation.org
Cc: zokeefe@...gle.com,
	ryan.roberts@....com,
	21cnbao@...il.com,
	shy828301@...il.com,
	david@...hat.com,
	mhocko@...e.com,
	fengwei.yin@...el.com,
	xiehuan09@...il.com,
	wangkefeng.wang@...wei.com,
	songmuchun@...edance.com,
	peterx@...hat.com,
	minchan@...nel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Lance Yang <ioworker0@...il.com>
Subject: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

This patch optimizes lazyfreeing with PTE-mapped mTHP[1]
(Inspired by David Hildenbrand[2]). We aim to avoid unnecessary
folio splitting if the large folio is entirely within the given
range.

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
PTE-mapped folios of the same size results in the following
runtimes for madvise(MADV_FREE) in seconds (shorter is better):

Folio Size |   Old    |   New    | Change
------------------------------------------
      4KiB | 0.590251 | 0.590259 |    0%
     16KiB | 2.990447 | 0.185655 |  -94%
     32KiB | 2.547831 | 0.104870 |  -95%
     64KiB | 2.457796 | 0.052812 |  -97%
    128KiB | 2.281034 | 0.032777 |  -99%
    256KiB | 2.230387 | 0.017496 |  -99%
    512KiB | 2.189106 | 0.010781 |  -99%
   1024KiB | 2.183949 | 0.007753 |  -99%
   2048KiB | 0.002799 | 0.002804 |    0%

[1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
[2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/

Signed-off-by: Lance Yang <ioworker0@...il.com>
---
v1 -> v2:
 * Update the performance numbers
 * Update the changelog, suggested by Ryan Roberts
 * Check the COW folio, suggested by Yin Fengwei
 * Check if we are mapping all subpages, suggested by Barry Song,
 David Hildenbrand, Ryan Roberts
 * https://lore.kernel.org/linux-mm/20240225123215.86503-1-ioworker0@gmail.com/

 mm/madvise.c | 85 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 74 insertions(+), 11 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 44a498c94158..1437ac6eb25e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -616,6 +616,20 @@ static long madvise_pageout(struct vm_area_struct *vma,
 	return 0;
 }
 
+static inline bool can_mark_large_folio_lazyfree(unsigned long addr,
+						 struct folio *folio, pte_t *start_pte)
+{
+	int nr_pages = folio_nr_pages(folio);
+	fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+
+	for (int i = 0; i < nr_pages; i++)
+		if (page_mapcount(folio_page(folio, i)) != 1)
+			return false;
+
+	return nr_pages == folio_pte_batch(folio, addr, start_pte,
+					 ptep_get(start_pte), nr_pages, flags, NULL);
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (folio_test_large(folio)) {
 			int err;
+			unsigned long next_addr, align;
 
-			if (folio_estimated_sharers(folio) != 1)
-				break;
-			if (!folio_trylock(folio))
-				break;
+			if (folio_estimated_sharers(folio) != 1 ||
+			    !folio_trylock(folio))
+				goto skip_large_folio;
+
+			align = folio_nr_pages(folio) * PAGE_SIZE;
+			next_addr = ALIGN_DOWN(addr + align, align);
+
+			/*
+			 * If we mark only the subpages as lazyfree, or
+			 * cannot mark the entire large folio as lazyfree,
+			 * then just split it.
+			 */
+			if (next_addr > end || next_addr - addr != align ||
+			    !can_mark_large_folio_lazyfree(addr, folio, pte))
+				goto split_large_folio;
+
+			/*
+			 * Avoid unnecessary folio splitting if the large
+			 * folio is entirely within the given range.
+			 */
+			folio_clear_dirty(folio);
+			folio_unlock(folio);
+			for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
+				ptent = ptep_get(pte);
+				if (pte_young(ptent) || pte_dirty(ptent)) {
+					ptent = ptep_get_and_clear_full(
+						mm, addr, pte, tlb->fullmm);
+					ptent = pte_mkold(ptent);
+					ptent = pte_mkclean(ptent);
+					set_pte_at(mm, addr, pte, ptent);
+					tlb_remove_tlb_entry(tlb, pte, addr);
+				}
+			}
+			folio_mark_lazyfree(folio);
+			goto next_folio;
+
+split_large_folio:
 			folio_get(folio);
 			arch_leave_lazy_mmu_mode();
 			pte_unmap_unlock(start_pte, ptl);
@@ -688,13 +736,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (err)
-				break;
-			start_pte = pte =
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-			if (!start_pte)
-				break;
-			arch_enter_lazy_mmu_mode();
+
+			/*
+			 * If the large folio is locked or cannot be split,
+			 * we just skip it.
+			 */
+			if (err) {
+skip_large_folio:
+				if (next_addr >= end)
+					break;
+				pte += (next_addr - addr) / PAGE_SIZE;
+				addr = next_addr;
+			}
+
+			if (!start_pte) {
+				start_pte = pte = pte_offset_map_lock(
+					mm, pmd, addr, &ptl);
+				if (!start_pte)
+					break;
+				arch_enter_lazy_mmu_mode();
+			}
+
+next_folio:
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
-- 
2.33.1