lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77paexkefc7qkfjgv6reuf7jxlysgkinuswsck5tthqkpcjkpr@aelvplwvafnt>
Date: Thu, 18 Dec 2025 11:27:24 +0800
From: Vernon Yang <vernon2gm@...il.com>
To: Wei Yang <richard.weiyang@...il.com>
Cc: akpm@...ux-foundation.org, david@...nel.org, 
	lorenzo.stoakes@...cle.com, ziy@...dia.com, baohua@...nel.org, lance.yang@...ux.dev, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Vernon Yang <yanglincheng@...inos.cn>
Subject: Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been
 collapsed

On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
> >The following data is traced by bpftrace on a desktop system. After
> >the system has been left idle for 10 minutes upon booting, a lot of
> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >khugepaged.
> >
> >@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> >@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> >total progress size: 701 MB
> >Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> >The khugepaged_scan list save all task that support collapse into hugepage,
> >as long as the take is not destroyed, khugepaged will not remove it from
> >the khugepaged_scan list. This exist a phenomenon where task has already
> >collapsed all memory regions into hugepage, but khugepaged continues to
> >scan it, which wastes CPU time and invalid, and due to
> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >scanning a large number of invalid task, so scanning really valid task
> >is later.
> >
> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >khugepaged.
>
> Two thing s come up my mind:
>
>   * what happens if we split the huge page under memory pressure?

static unsigned int shrink_folio_list(struct list_head *folio_list,
		struct pglist_data *pgdat, struct scan_control *sc,
		struct reclaim_stat *stat, bool ignore_references,
		struct mem_cgroup *memcg)
{
	...

	folio = lru_to_folio(folio_list);

	...

	references = folio_check_references(folio, sc);
	switch (references) {
	case FOLIOREF_ACTIVATE:
		goto activate_locked;
	case FOLIOREF_KEEP:
		stat->nr_ref_keep += nr_pages;
		goto keep_locked;
	case FOLIOREF_RECLAIM:
	case FOLIOREF_RECLAIM_CLEAN:
		; /* try to reclaim the folio below */
	}

	...

	split_folio_to_list(folio, folio_list);
}

During memory reclaim above, only inactive folios are split. This also
implies that the folio is cold, meaning it hasn't been used recently, so
we do not expect to put the mm back onto the khugepaged scan list to
continue scan/collapse. khugeapged needs to scan hot folios as much as
possible priorityly and collapse hot folios to avoid wasting CPU.

>   * would this interfere with mTHP collapse?

It has no impact on mTHP collapse, only when all memory is either
SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
other cases will not be removed.

Let me know if I missed something please, thanks!

>
> >
> >Signed-off-by: Vernon Yang <yanglincheng@...inos.cn>
> >---
> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 0598a19a98cc..1ec1af5be3c8 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
> > 	struct list_head mm_head;
> > 	struct mm_slot *mm_slot;
> > 	unsigned long address;
> >+	bool maybe_collapse;
> > };
> >
> > static struct khugepaged_scan khugepaged_scan = {
> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > 	return result;
> > }
> >
> >-static void collect_mm_slot(struct mm_slot *slot)
> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> > {
> > 	struct mm_struct *mm = slot->mm;
> >
> > 	lockdep_assert_held(&khugepaged_mm_lock);
> >
> >-	if (hpage_collapse_test_exit(mm)) {
> >+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> > 		/* free mm_slot */
> > 		hash_del(&slot->hash);
> > 		list_del(&slot->mm_node);
> >
> >-		/*
> >-		 * Not strictly needed because the mm exited already.
> >-		 *
> >-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >-		 */
> >+		if (!maybe_collapse)
> >+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >
> > 		/* khugepaged_mm_lock actually not necessary for the below */
> > 		mm_slot_free(mm_slot_cache, slot);
> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 				     struct mm_slot, mm_node);
> > 		khugepaged_scan.address = 0;
> > 		khugepaged_scan.mm_slot = slot;
> >+		khugepaged_scan.maybe_collapse = false;
> > 	}
> > 	spin_unlock(&khugepaged_mm_lock);
> >
> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 					khugepaged_scan.address, &mmap_locked, cc);
> > 			}
> >
> >-			if (*result == SCAN_SUCCEED)
> >+			switch (*result) {
> >+			case SCAN_PMD_NULL:
> >+			case SCAN_PMD_NONE:
> >+			case SCAN_PMD_MAPPED:
> >+			case SCAN_PTE_MAPPED_HUGEPAGE:
> >+				break;
> >+			case SCAN_SUCCEED:
> > 				++khugepaged_pages_collapsed;
> >+				fallthrough;
>
> If collapse successfully, we don't need to set maybe_collapse to true?

Above "fallthrough" explicitly tells the compiler that when the collapse is
successful, run below "khugepaged_scan.maybe_collapse = true" :)

> >+			default:
> >+				khugepaged_scan.maybe_collapse = true;
> >+			}
> >
> > 			/* move to next address */
> > 			khugepaged_scan.address += HPAGE_PMD_SIZE;
> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 	 * if we scanned all vmas of this mm.
> > 	 */
> > 	if (hpage_collapse_test_exit(mm) || !vma) {
> >+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
> >+
> >+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> >+			maybe_collapse = true;
> >+
> > 		/*
> > 		 * Make sure that if mm_users is reaching zero while
> > 		 * khugepaged runs here, khugepaged_exit will find
> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> > 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> > 			khugepaged_scan.address = 0;
> >+			khugepaged_scan.maybe_collapse = false;
> > 		} else {
> > 			khugepaged_scan.mm_slot = NULL;
> > 			khugepaged_full_scans++;
> > 		}
> >
> >-		collect_mm_slot(slot);
> >+		collect_mm_slot(slot, maybe_collapse);
> > 	}
> >
> > 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> > 	slot = khugepaged_scan.mm_slot;
> > 	khugepaged_scan.mm_slot = NULL;
> > 	if (slot)
> >-		collect_mm_slot(slot);
> >+		collect_mm_slot(slot, true);
> > 	spin_unlock(&khugepaged_mm_lock);
> > 	return 0;
> > }
> >--
> >2.51.0
> >
>
> --
> Wei Yang
> Help you, Help me

--
Thanks,
Vernon

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ