lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80341f56-6f8f-4e41-8f80-f1863256c662@lucifer.local>
Date: Tue, 14 Oct 2025 13:36:46 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Lance Yang <lance.yang@...ux.dev>
Cc: akpm@...ux-foundation.org, david@...hat.com, ziy@...dia.com,
        baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
        baohua@...nel.org, ioworker0@...il.com, richard.weiyang@...il.com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic
 into a new helper

On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@...ux.dev>
>
> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate() was almost duplicated.
>
> This patch cleans things up by moving all the common PTE checking logic
> into a new shared helper, thp_collapse_check_pte(). While at it, we use
> vm_normal_folio() instead of vm_normal_page().
>
> Suggested-by: David Hildenbrand <david@...hat.com>
> Suggested-by: Dev Jain <dev.jain@....com>
> Signed-off-by: Lance Yang <lance.yang@...ux.dev>

In general I like the idea, the implementation needs work though.

Will check in more detail on respin

> ---
>  mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>  1 file changed, 130 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b5c0295c3414..7116caae1fa4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,12 @@ enum scan_result {
>  	SCAN_PAGE_FILLED,
>  };
>
> +enum pte_check_result {
> +	PTE_CHECK_SUCCEED,
> +	PTE_CHECK_CONTINUE,

Don't love this logic - this feels like we're essentially abstracting
control flow, I mean what does 'continue' mean here? Other than you're in a
loop and please continue, which is relying a little too much on what the
caller does.

if we retain this logic something like PTE_CHECK_SKIP would make more sense.


> +	PTE_CHECK_FAIL,
> +};
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>
> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  	}
>  }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte:           The PTE to check
> + * @vma:           The VMA the PTE belongs to
> + * @addr:          The virtual address corresponding to this PTE
> + * @foliop:        On success, used to return a pointer to the folio
> + *                 Must be non-NULL
> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared:        Counter for shared pages. Must be non-NULL
> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
> + *                 PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc:            Collapse control settings

Do we really need this many parameters? THis is hard to follow.

Of course it being me, I'd immediately prefer a helper struct :)

> + *
> + * Returns:
> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + *   PTE_CHECK_FAIL     - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,

There's no need for inline in an internal static function in a file.

> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
> +		int *unmapped, int *shared, int *scan_result,
> +		struct collapse_control *cc)
> +{
> +	struct folio *folio = NULL;
> +
> +	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> +		(*none_or_zero)++;
> +		if (!userfaultfd_armed(vma) &&
> +		    (!cc->is_khugepaged ||
> +		     *none_or_zero <= khugepaged_max_ptes_none)) {
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_NONE_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (!pte_present(pte)) {

You're now making the if-else issues with previous patches worse by
returning which gets even checkpatch to warn you.

	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {

(of course note that I am not convinced you can drop the pte_present()
check here)

		(*none_or_zero)++;
		if (!userfaultfd_armed(vma) &&
		    (!cc->is_khugepaged ||
		     *none_or_zero <= khugepaged_max_ptes_none))
			return PTE_CHECK_CONTINUE;

		*scan_result = SCAN_EXCEED_NONE_PTE;
		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
		return PTE_CHECK_FAIL;
	}

	if (!pte_present(pte)) {
		...
	}

But even that really needs seriously more refactoring - that condition
above is horrible for instance so, e.g.:

	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
		(*none_or_zero)++;

		/* Cases where this is acceptable. */
		if (!userfaultfd_armed(vma))
			return PTE_CHECK_SKIP;
		if (!cc->is_khugepaged)
			return PTE_CHECK_SKIP;
		if (*none_or_zero <= khugepaged_max_ptes_none)
			return PTE_CHECK_SKIP;

		/* Otherwise, we must abort the scan. */
		*scan_result = SCAN_EXCEED_NONE_PTE;
		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
		return PTE_CHECK_FAIL;
	}

	if (!pte_present(pte)) {
		...
	}

Improves things a lot.

I do however _hate_ this (*blah)++; thing. A helper struct would help :)



> +		if (!unmapped) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;

Can't we have a helper that sets the result, e.g.:

	return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);

static enum pte_check_result pte_check_fail(int *scan_result,
		enum pte_check_result result)
{
	*scan_result = result;
	return PTE_CHECK_FAIL;
}

And same for success/skip

Then all of these horrible if (blah) { *foo = bar; return baz; } can be
made into

	if (blah)
		return pte_check_xxx(..., SCAN_PTE_...);

> +		}
> +
> +		if (non_swap_entry(pte_to_swp_entry(pte))) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;
> +		}
> +
> +		(*unmapped)++;
> +		if (!cc->is_khugepaged ||
> +		    *unmapped <= khugepaged_max_ptes_swap) {
> +			/*
> +			 * Always be strict with uffd-wp enabled swap
> +			 * entries. Please see comment below for
> +			 * pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp(pte)) {
> +				*scan_result = SCAN_PTE_UFFD_WP;
> +				return PTE_CHECK_FAIL;
> +			}
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (pte_uffd_wp(pte)) {
> +		/*
> +		 * Don't collapse the page if any of the small PTEs are
> +		 * armed with uffd write protection. Here we can also mark
> +		 * the new huge pmd as write protected if any of the small
> +		 * ones is marked but that could bring unknown userfault
> +		 * messages that falls outside of the registered range.
> +		 * So, just be simple.
> +		 */
> +		*scan_result = SCAN_PTE_UFFD_WP;
> +		return PTE_CHECK_FAIL;
> +	}
> +

Again as all of the comments for previous series still apply here so
obviously I don't like this as-is :)

And as Andrew has pointed out, now checkpatch is finding the 'pointless
else' issue (as mentioned above also).

> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> +		*scan_result = SCAN_PAGE_NULL;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	if (!folio_test_anon(folio)) {
> +		VM_WARN_ON_FOLIO(true, folio);
> +		*scan_result = SCAN_PAGE_ANON;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	/*
> +	 * We treat a single page as shared if any part of the THP
> +	 * is shared.
> +	 */
> +	if (folio_maybe_mapped_shared(folio)) {
> +		(*shared)++;
> +		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> +			*scan_result = SCAN_EXCEED_SHARED_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	}
> +
> +	*foliop = folio;
> +
> +	return PTE_CHECK_SUCCEED;
> +}
> +
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long start_addr,
>  					pte_t *pte,
>  					struct collapse_control *cc,
>  					struct list_head *compound_pagelist)
>  {
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> +	int pte_check_res;
>
>  	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			result = SCAN_PTE_NON_PRESENT;
> -			goto out;
> -		} else if (pte_uffd_wp(pteval)) {
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out;
> -		}
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out;
> -		}
>
> -		folio = page_folio(page);
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> -			goto out;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, NULL, &shared,
> +					&result, cc);
>
> -		/* See hpage_collapse_scan_pmd(). */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out;
> -			}
> -		}
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
> +			goto out;
>
>  		if (folio_test_large(folio)) {
>  			struct folio *f;
> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	pte_t *pte, *_pte;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> +	int pte_check_res;
>
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			if (non_swap_entry(pte_to_swp_entry(pteval))) {
> -				result = SCAN_PTE_NON_PRESENT;
> -				goto out_unmap;
> -			}
>
> -			++unmapped;
> -			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_SWAP_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (pte_uffd_wp(pteval)) {
> -			/*
> -			 * Don't collapse the page if any of the small
> -			 * PTEs are armed with uffd write protection.
> -			 * Here we can also mark the new huge pmd as
> -			 * write protected if any of the small ones is
> -			 * marked but that could bring unknown
> -			 * userfault messages that falls outside of
> -			 * the registered range.  So, just be simple.
> -			 */
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out_unmap;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, &unmapped,
> +					&shared, &result, cc);
>
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out_unmap;
> -		}
> -		folio = page_folio(page);
> -
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
>  			goto out_unmap;
> -		}
> -
> -		/*
> -		 * We treat a single page as shared if any part of the THP
> -		 * is shared.
> -		 */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out_unmap;
> -			}
> -		}
>
>  		/*
>  		 * Record which node the original page is from and save this
> --
> 2.49.0
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ