linux-kernel - Re: [PATCH v5 13/14] mm: memory: support clearing page-extents

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6413d17-c530-4553-9eca-dec8dce37e7e@redhat.com>
Date: Wed, 16 Jul 2025 00:08:57 +0200
From: David Hildenbrand <david@...hat.com>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
 hpa@...or.com, mingo@...hat.com, mjguzik@...il.com, luto@...nel.org,
 peterz@...radead.org, acme@...nel.org, namhyung@...nel.org,
 tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
 boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v5 13/14] mm: memory: support clearing page-extents

On 10.07.25 02:59, Ankur Arora wrote:
> folio_zero_user() is constrained to clear in a page-at-a-time
> fashion because it supports CONFIG_HIGHMEM which means that kernel
> mappings for pages in a folio are not guaranteed to be contiguous.
> 
> We don't have this problem when running under configurations with
> CONFIG_CLEAR_PAGE_EXTENT (implies !CONFIG_HIGHMEM), so zero in
> longer page-extents.
> This is expected to be faster because the processor can now optimize
> the clearing based on the knowledge of the extent.
> 
> However, clearing in larger chunks can have two other problems:
> 
>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>     (larger folios don't have any expectation of cache locality).
> 
>   - preemption latency when clearing large folios.
> 
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; the local neighbourhood is cleared last.
> 
> The second problem is relevant only when running under cooperative
> preemption models. Limit the worst case preemption latency by clearing
> in architecture specified ARCH_CLEAR_PAGE_EXTENT units.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
> ---
>   mm/memory.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..c52806270375 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7034,6 +7034,7 @@ static inline int process_huge_page(
>   	return 0;
>   }
>   
> +#ifndef CONFIG_CLEAR_PAGE_EXTENT
>   static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>   				unsigned int nr_pages)
>   {
> @@ -7058,7 +7059,10 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
>   /**
>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>    * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * folio_zero_user() uses clear_gigantic_page() or process_huge_page() to
> + * do page-at-a-time zeroing because it needs to handle CONFIG_HIGHMEM.
>    */
>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
> @@ -7070,6 +7074,86 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>   }
>   
> +#else /* CONFIG_CLEAR_PAGE_EXTENT */
> +
> +static void clear_pages_resched(void *addr, int npages)
> +{
> +	int i, remaining;
> +
> +	if (preempt_model_preemptible()) {
> +		clear_pages(addr, npages);
> +		goto out;
> +	}
> +
> +	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
> +			    ARCH_CLEAR_PAGE_EXTENT);
> +		cond_resched();
> +	}
> +
> +	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
> +
> +	if (remaining)
> +		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
> +			    remaining);
> +out:
> +	cond_resched();
> +}
> +
> +/*
> + * folio_zero_user - Zero a folio which will be mapped to userspace.
> + * @folio: The folio to zero.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood cleared last in order to keep cache
> + * lines of the target region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.
> + */
> +void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> +{
> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	const int width = 2; /* number of pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);
> +
> +	for (i = 0; i <= 2; i++) {
> +		int npages = range_len(&r[i]);
> +
> +		if (npages > 0)
> +			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
> +	}
> +}
> +#endif /* CONFIG_CLEAR_PAGE_EXTENT */
> +
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>   				   unsigned long addr_hint,
>   				   struct vm_area_struct *vma,

So, folio_zero_user() is only compiled with THP | HUGETLB already.

What we should probably do is scrap the whole new kconfig option and
do something like this in here:

diff --git a/mm/memory.c b/mm/memory.c
index 3dd6c57e6511e..64b6bd3e7657a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7009,19 +7009,53 @@ static inline int process_huge_page(
  	return 0;
  }
  
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+#ifdef CONFIG_ARCH_HAS_CLEAR_PAGES
+static void clear_user_highpages_resched(struct page *page,
+		unsigned int nr_pages, unsigned long addr)
+{
+	void *addr = page_address(page);
+	int i, remaining;
+
+	/*
+	 * CONFIG_ARCH_HAS_CLEAR_PAGES is not expected to be set on systems
+	 * with HIGHMEM, so we can safely use clear_pages().
+	 */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_HIGHMEM));
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/ARCH_CLEAR_PAGE_EXTENT; i++) {
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SIZE,
+			    ARCH_CLEAR_PAGE_EXTENT);
+		cond_resched();
+	}
+
+	remaining = npages % ARCH_CLEAR_PAGE_EXTENT;
+
+	if (remaining)
+		clear_pages(addr + i * ARCH_CLEAR_PAGE_EXTENT * PAGE_SHIFT,
+			    remaining);
+out:
+	cond_resched();
+}
+#else
+static void clear_user_highpages_resched(struct page *page,
+		unsigned int nr_pages, unsigned long addr)
  {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
  	int i;
  
  	might_sleep();
  	for (i = 0; i < nr_pages; i++) {
  		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+		clear_user_highpage(nth_page(page, i), addr + i * PAGE_SIZE);
  	}
  }
  
+#endif /* CONFIG_ARCH_HAS_CLEAR_PAGES */
+
  static int clear_subpage(unsigned long addr, int idx, void *arg)
  {
  	struct folio *folio = arg;
@@ -7030,19 +7064,76 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
  	return 0;
  }
  
-/**
+static void folio_zero_user_huge(struct folio *folio, unsigned long addr_hint)
+{
+	const unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
+	const long fault_idx = (addr_hint - addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, nr_pages - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	/*
+	 * Without an optimized clear_user_highpages_resched(), we'll perform
+	 * some extra magic dance around the faulting address.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_CLEAR_PAGES)) {
+		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int cur_nr_pages = range_len(&r[i]);
+		struct page *cur_page = folio_page(folio, r[i].start);
+		unsigned long cur_addr = addr + folio_page_idx(folio, cur_page) * PAGE_SIZE;
+
+		if (cur_nr_pages > 0)
+			clear_user_highpages_resched(cur_page, cur_nr_pages, cur_addr);
+	}
+}
+
+/*
   * folio_zero_user - Zero a folio which will be mapped to userspace.
   * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
   */
  void folio_zero_user(struct folio *folio, unsigned long addr_hint)
  {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned int nr_pages = folio_nr_pages(folio);
+	const unsigned long addr = ALIGN_DOWN(addr_hint, nr_pages * PAGE_SIZE);
  
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	if (unlikely(nr_pages >= MAX_ORDER_NR_PAGES)) {
+		clear_user_highpages_resched(folio_page(folio, 0), nr_pages, addr);
+		return;
+	}
+	folio_zero_user_huge(folio, addr_hint);
  }
  
  static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.50.1



Note that this probably completely broken in various ways, just to give you
an idea.

*maybe* we could change clear_user_highpages_resched to something like
folio_zero_user_range(), consuming a folio + idx instead of a page. That might
or might not be better here.

-- 
Cheers,

David / dhildenb