lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6nf3cxwhij7jtfi2u6nmt4igezf754gmue5dfskn4jkfkxmjzr@7btdipzmzjuo>
Date: Fri, 30 May 2025 12:23:19 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
        Madhavan Srinivasan <maddy@...ux.ibm.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        "David S. Miller" <davem@...emloft.net>,
        Andreas Larsson <andreas@...sler.com>, Juergen Gross <jgross@...e.com>,
        Ajay Kaher <ajay.kaher@...adcom.com>,
        Alexey Makhalov <alexey.makhalov@...adcom.com>,
        Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>, Arnd Bergmann <arnd@...db.de>,
        David Hildenbrand <david@...hat.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Andrey Ryabinin <ryabinin.a.a@...il.com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org, sparclinux@...r.kernel.org,
        virtualization@...ts.linux.dev, xen-devel@...ts.xenproject.org,
        linux-mm@...ck.org
Subject: Re: [RFC PATCH v1 3/6] mm: Avoid calling page allocator from
 apply_to_page_range()

* Ryan Roberts <ryan.roberts@....com> [250530 10:05]:
> Lazy mmu mode applies to the current task and permits pte modifications
> to be deferred and updated at a later time in a batch to improve
> performance. apply_to_page_range() calls its callback in lazy mmu mode
> and some of those callbacks call into the page allocator to either
> allocate or free pages.
> 
> This is problematic with CONFIG_DEBUG_PAGEALLOC because
> debug_pagealloc_[un]map_pages() calls the arch implementation of
> __kernel_map_pages() which must modify the ptes for the linear map.
> 
> There are two possibilities at this point:
> 
>  - If the arch implementation modifies the ptes directly without first
>    entering lazy mmu mode, the pte modifications may get deferred until
>    the existing lazy mmu mode is exited. This could result in taking
>    spurious faults for example.
> 
>  - If the arch implementation enters a nested lazy mmu mode before
>    modification of the ptes (many arches use apply_to_page_range()),
>    then the linear map updates will definitely be applied upon leaving
>    the inner lazy mmu mode. But because lazy mmu mode does not support
>    nesting, the remainder of the outer user is no longer in lazy mmu
>    mode and the optimization opportunity is lost.
> 
> So let's just ensure that the page allocator is never called from within
> lazy mmu mode. New "_nolazy" variants of apply_to_page_range() and
> apply_to_existing_page_range() are introduced which don't enter lazy mmu
> mode. Then users which need to call into the page allocator within their
> callback are updated to use the _nolazy variants.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> ---
>  include/linux/mm.h |  6 ++++++
>  kernel/bpf/arena.c |  6 +++---
>  mm/kasan/shadow.c  |  2 +-
>  mm/memory.c        | 54 +++++++++++++++++++++++++++++++++++-----------
>  4 files changed, 51 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e51dba8398f7..11cae6ce04ff 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3743,9 +3743,15 @@ static inline bool gup_can_follow_protnone(struct vm_area_struct *vma,
>  typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  			       unsigned long size, pte_fn_t fn, void *data);
> +extern int apply_to_page_range_nolazy(struct mm_struct *mm,
> +				      unsigned long address, unsigned long size,
> +				      pte_fn_t fn, void *data);

We are removing externs as things are edited, so probably drop them
here.

>  extern int apply_to_existing_page_range(struct mm_struct *mm,
>  				   unsigned long address, unsigned long size,
>  				   pte_fn_t fn, void *data);
> +extern int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +				   unsigned long address, unsigned long size,
> +				   pte_fn_t fn, void *data);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern void __kernel_poison_pages(struct page *page, int numpages);
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 0d56cea71602..ca833cfeefb7 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -187,10 +187,10 @@ static void arena_map_free(struct bpf_map *map)
>  	/*
>  	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
>  	 * It unmaps everything from vmalloc area and clears pgtables.
> -	 * Call apply_to_existing_page_range() first to find populated ptes and
> -	 * free those pages.
> +	 * Call apply_to_existing_page_range_nolazy() first to find populated
> +	 * ptes and free those pages.
>  	 */
> -	apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +	apply_to_existing_page_range_nolazy(&init_mm, bpf_arena_get_kern_vm_start(arena),
>  				     KERN_VM_SZ - GUARD_SZ, existing_page_cb, NULL);
>  	free_vm_area(arena->kern_vm);
>  	range_tree_destroy(&arena->rt);
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..2325c5166c3a 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -590,7 +590,7 @@ void kasan_release_vmalloc(unsigned long start, unsigned long end,
>  
>  
>  		if (flags & KASAN_VMALLOC_PAGE_RANGE)
> -			apply_to_existing_page_range(&init_mm,
> +			apply_to_existing_page_range_nolazy(&init_mm,
>  					     (unsigned long)shadow_start,
>  					     size, kasan_depopulate_vmalloc_pte,
>  					     NULL);
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..24436074ce48 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2913,7 +2913,7 @@ EXPORT_SYMBOL(vm_iomap_memory);
>  static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pte_t *pte, *mapped_pte;
>  	int err = 0;
> @@ -2933,7 +2933,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			return -EINVAL;
>  	}
>  
> -	arch_enter_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_enter_lazy_mmu_mode();
>  
>  	if (fn) {
>  		do {
> @@ -2946,7 +2947,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  	}
>  	*mask |= PGTBL_PTE_MODIFIED;
>  
> -	arch_leave_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_leave_lazy_mmu_mode();
>  
>  	if (mm != &init_mm)
>  		pte_unmap_unlock(mapped_pte, ptl);
> @@ -2956,7 +2958,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)

I am having a hard time understanding why other lazy mmus were more
self-contained, but arm has added arguments to generic code?

>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -2983,7 +2985,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  			pmd_clear_bad(pmd);
>  		}
>  		err = apply_to_pte_range(mm, pmd, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pmd++, addr = next, addr != end);
> @@ -2994,7 +2996,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
>  static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -3019,7 +3021,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  			pud_clear_bad(pud);
>  		}
>  		err = apply_to_pmd_range(mm, pud, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (pud++, addr = next, addr != end);
> @@ -3030,7 +3032,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
>  static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  				     unsigned long addr, unsigned long end,
>  				     pte_fn_t fn, void *data, bool create,
> -				     pgtbl_mod_mask *mask)
> +				     pgtbl_mod_mask *mask, bool lazy_mmu)
>  {
>  	p4d_t *p4d;
>  	unsigned long next;
> @@ -3055,7 +3057,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  			p4d_clear_bad(p4d);
>  		}
>  		err = apply_to_pud_range(mm, p4d, addr, next,
> -					 fn, data, create, mask);
> +					 fn, data, create, mask, lazy_mmu);
>  		if (err)
>  			break;
>  	} while (p4d++, addr = next, addr != end);
> @@ -3065,7 +3067,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
>  
>  static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn,
> -				 void *data, bool create)
> +				 void *data, bool create, bool lazy_mmu)
>  {
>  	pgd_t *pgd;
>  	unsigned long start = addr, next;
> @@ -3091,7 +3093,7 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			pgd_clear_bad(pgd);
>  		}
>  		err = apply_to_p4d_range(mm, pgd, addr, next,
> -					 fn, data, create, &mask);
> +					 fn, data, create, &mask, lazy_mmu);

This is annoying.  We now have a bool, bool passed through with mask
inserted in the middle.

>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> @@ -3105,11 +3107,14 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  /*
>   * Scan a region of virtual memory, filling in page tables as necessary
>   * and calling a provided function on each leaf page table.
> + *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
>   */
>  int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  			unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, true);
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, true);

Please add something here to tell me what false, true is.

>  }
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>  
> @@ -3117,13 +3122,36 @@ EXPORT_SYMBOL_GPL(apply_to_page_range);
>   * Scan a region of virtual memory, calling a provided function on
>   * each leaf page table where it exists.
>   *
> + * fn() is called in lazy mmu mode. As a result, the callback must be careful
> + * not to perform memory allocation.
> + *
>   * Unlike apply_to_page_range, this does _not_ fill in page tables
>   * where they are absent.
>   */
>  int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
>  				 unsigned long size, pte_fn_t fn, void *data)
>  {
> -	return __apply_to_page_range(mm, addr, size, fn, data, false);
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, true);

every..

> +}
> +
> +/*
> + * As per apply_to_page_range() but fn() is not called in lazy mmu mode.
> + */
> +int apply_to_page_range_nolazy(struct mm_struct *mm, unsigned long addr,
> +			       unsigned long size, pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, true, false);

one...

> +}
> +
> +/*
> + * As per apply_to_existing_page_range() but fn() is not called in lazy mmu
> + * mode.
> + */
> +int apply_to_existing_page_range_nolazy(struct mm_struct *mm,
> +					unsigned long addr, unsigned long size,
> +					pte_fn_t fn, void *data)
> +{
> +	return __apply_to_page_range(mm, addr, size, fn, data, false, false);

adds confusion :)


These wrappers are terrible for readability and annoying for argument
lists too.

Could we do something like the pgtbl_mod_mask or zap_details and pass
through a struct or one unsigned int for create and lazy_mmu?

At least we'd have better self-documenting code in the wrappers.. and if
we ever need a third boolean, we could avoid multiplying the wrappers
again.

WDYT?

Cheers,
Liam

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ