linux-kernel - [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20240225123215.86503-1-ioworker0@gmail.com>
Date: Sun, 25 Feb 2024 20:32:15 +0800
From: Lance Yang <ioworker0@...il.com>
To: akpm@...ux-foundation.org
Cc: zokeefe@...gle.com,
	shy828301@...il.com,
	david@...hat.com,
	mhocko@...e.com,
	ryan.roberts@....com,
	wangkefeng.wang@...wei.com,
	songmuchun@...edance.com,
	peterx@...hat.com,
	minchan@...nel.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Lance Yang <ioworker0@...il.com>
Subject: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

This patch improves madvise_free_pte_range() to correctly
handle large folio that is smaller than PMD-size
(for example, 16KiB to 1024KiB[1]). It’s probably part of
the preparation to support anonymous multi-size THP.

Additionally, when the consecutive PTEs are mapped to
consecutive pages of the same large folio (mTHP), if the
folio is locked before madvise(MADV_FREE) or cannot be
split, then all subsequent PTEs within the same PMD will
be skipped. However, they should have been MADV_FREEed.

Moreover, this patch also optimizes lazyfreeing with
PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
aim to avoid unnecessary folio splitting if the large
folio is entirely within the given range.

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
PTE-mapped folios of the same size results in the following
runtimes for madvise(MADV_FREE) in seconds (shorter is better):

Folio Size  |    Old     |    New     |  Change
----------------------------------------------
      4KiB  |  0.590251  |  0.590264  |     0%
     16KiB  |  2.990447  |  0.182167  |   -94%
     32KiB  |  2.547831  |  0.101622  |   -96%
     64KiB  |  2.457796  |  0.049726  |   -98%
    128KiB  |  2.281034  |  0.030109  |   -99%
    256KiB  |  2.230387  |  0.015838  |   -99%
    512KiB  |  2.189106  |  0.009149  |   -99%
   1024KiB  |  2.183949  |  0.006620  |   -99%
   2048KiB  |  0.002799  |  0.002795  |     0%

[1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
[2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/

Signed-off-by: Lance Yang <ioworker0@...il.com>
---
 mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 11 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index cfa5e7288261..bcbf56595a2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (folio_test_large(folio)) {
 			int err;
+			unsigned long next_addr, align;
 
-			if (folio_estimated_sharers(folio) != 1)
-				break;
-			if (!folio_trylock(folio))
-				break;
+			if (folio_estimated_sharers(folio) != 1 ||
+			    !folio_trylock(folio))
+				goto skip_large_folio;
+
+			align = folio_nr_pages(folio) * PAGE_SIZE;
+			next_addr = ALIGN_DOWN(addr + align, align);
+
+			/*
+			 * If we mark only the subpages as lazyfree,
+			 * split the large folio.
+			 */
+			if (next_addr > end || next_addr - addr != align)
+				goto split_large_folio;
+
+			/*
+			 * Avoid unnecessary folio splitting if the large
+			 * folio is entirely within the given range.
+			 */
+			folio_test_clear_dirty(folio);
+			folio_unlock(folio);
+			for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
+				ptent = ptep_get(pte);
+				if (pte_young(ptent) || pte_dirty(ptent)) {
+					ptent = ptep_get_and_clear_full(
+						mm, addr, pte, tlb->fullmm);
+					ptent = pte_mkold(ptent);
+					ptent = pte_mkclean(ptent);
+					set_pte_at(mm, addr, pte, ptent);
+					tlb_remove_tlb_entry(tlb, pte, addr);
+				}
+			}
+			folio_mark_lazyfree(folio);
+			goto next_folio;
+
+split_large_folio:
 			folio_get(folio);
 			arch_leave_lazy_mmu_mode();
 			pte_unmap_unlock(start_pte, ptl);
@@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (err)
-				break;
-			start_pte = pte =
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-			if (!start_pte)
-				break;
-			arch_enter_lazy_mmu_mode();
+
+			/*
+			 * If the large folio is locked before madvise(MADV_FREE)
+			 * or cannot be split, we just skip it.
+			 */
+			if (err) {
+skip_large_folio:
+				if (next_addr >= end)
+					break;
+				pte += (next_addr - addr) / PAGE_SIZE;
+				addr = next_addr;
+			}
+
+			if (!start_pte) {
+				start_pte = pte = pte_offset_map_lock(
+					mm, pmd, addr, &ptl);
+				if (!start_pte)
+					break;
+				arch_enter_lazy_mmu_mode();
+			}
+
+next_folio:
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
-- 
2.33.1