linux-kernel - Re: [PATCH v4 07/12] mm: enable lazy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <999feffa-5d1d-42e3-bd3a-d949f2a9de9d@arm.com>
Date: Fri, 7 Nov 2025 14:59:53 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Kevin Brodsky <kevin.brodsky@....com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, Alexander Gordeev <agordeev@...ux.ibm.com>,
 Andreas Larsson <andreas@...sler.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Boris Ostrovsky <boris.ostrovsky@...cle.com>, Borislav Petkov
 <bp@...en8.de>, Catalin Marinas <catalin.marinas@....com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 David Hildenbrand <david@...hat.com>, "David S. Miller"
 <davem@...emloft.net>, David Woodhouse <dwmw2@...radead.org>,
 "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
 Jann Horn <jannh@...gle.com>, Juergen Gross <jgross@...e.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Madhavan Srinivasan <maddy@...ux.ibm.com>,
 Michael Ellerman <mpe@...erman.id.au>, Michal Hocko <mhocko@...e.com>,
 Mike Rapoport <rppt@...nel.org>, Nicholas Piggin <npiggin@...il.com>,
 Peter Zijlstra <peterz@...radead.org>, Suren Baghdasaryan
 <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Vlastimil Babka <vbabka@...e.cz>, Will Deacon <will@...nel.org>,
 Yeoreum Yun <yeoreum.yun@....com>, linux-arm-kernel@...ts.infradead.org,
 linuxppc-dev@...ts.ozlabs.org, sparclinux@...r.kernel.org,
 xen-devel@...ts.xenproject.org, x86@...nel.org
Subject: Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest

On 29/10/2025 10:09, Kevin Brodsky wrote:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
> 
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
> 
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
> 
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
>   calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
> 
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
>   to the arch via arch_{enter,leave} - lazy MMU remains enabled so
>   the assumption is that these callbacks are not relevant. However,
>   existing code may rely on a call to disable() to flush any batched
>   state, regardless of nesting. arch_flush_lazy_mmu_mode() is
>   therefore called in that situation.
> 
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
> 
> Whether the mode is actually enabled or not at any point is tracked
> by a separate "active" field in task_struct; this makes it possible
> to check invariants in the generic API, and to expose a new
> in_lazy_mmu_mode() helper to replace the various ways arch's
> currently track whether the mode is enabled (this will be done in
> later patches).
> 
> In summary (nesting/active represent the values *after* the call):
> 
> lazy_mmu_mode_enable()		-> arch_enter()	    nesting=1 active=1
>     lazy_mmu_mode_enable()	-> ø		    nesting=2 active=1
> 	lazy_mmu_mode_pause()	-> arch_leave()     nesting=2 active=0
> 	lazy_mmu_mode_resume()	-> arch_enter()     nesting=2 active=1
>     lazy_mmu_mode_disable()	-> arch_flush()     nesting=1 active=1
> lazy_mmu_mode_disable()		-> arch_leave()     nesting=0 active=0
> 
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
> 
> Signed-off-by: Kevin Brodsky <kevin.brodsky@....com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ------
>  include/linux/mm_types_task.h    |  5 +++
>  include/linux/pgtable.h          | 67 ++++++++++++++++++++++++++++++--
>  include/linux/sched.h            | 16 ++++++++
>  4 files changed, 84 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..535435248923 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>  
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
> -	/*
> -	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> -	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> -	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> -	 * permissions on the linear map with apply_to_page_range(), which
> -	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
> -	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
> -	 * flush and clear the flag such that the remainder of the work in the
> -	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
> -	 * keeps tracking simple.
> -	 */
> -
>  	if (in_interrupt())
>  		return;
>  
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..632d404f8191 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
>  #endif
>  };
>  
> +struct lazy_mmu_state {
> +	u8 nesting_level;
> +	bool active;
> +};
> +
>  #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5fdf32c437f..e6064e00b22d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>   * of the lazy mode. So the implementation must assume preemption may be enabled
>   * and cpu migration is possible; it must take steps to be robust against this.
>   * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
> + * in interrupt context.

"The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
which will enter lazy mmu (now using the new generic API). This can happen in
softirq context.


> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + *   lazy_mmu_mode_enable();
> + *   <code>
> + *   lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + *   lazy_mmu_mode_pause();
> + *   <code>
> + *   lazy_mmu_mode_resume();
> + *
> + * This sequence must only be used if the lazy MMU mode is already enabled.
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. <code> must not modify
> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
> + * helpers).
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
>   */

Nice documentation!

>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>  static inline void lazy_mmu_mode_enable(void)
>  {
> -	arch_enter_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
> +	/* enable() must not be called while paused */
> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
> +
> +	if (state->nesting_level++ == 0) {

Hmm... for the arm64 case of calling this in an interrupt, Is it safe?

If a task is calling this function and gets interrupted here, nesting_level==1
but active==false. The interrupt then calls this function and increments from 1
to 2 but arch_enter_lazy_mmu_mode() hasn't been called.

More dangerously (I think), when the interrupt handler calls
lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
could be an issue because as far as the arch is concerned, it's not in lazy mode.

The current arm64 implementation works because setting and clearing the thread
flags is atomic.

Perhaps you need to disable preemption around the if block?

> +		state->active = true;
> +		arch_enter_lazy_mmu_mode();
> +	}
>  }
>  
>  static inline void lazy_mmu_mode_disable(void)
>  {
> -	arch_leave_lazy_mmu_mode();
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON_ONCE(state->nesting_level == 0);
> +	VM_WARN_ON(!state->active);
> +
> +	if (--state->nesting_level == 0) {
> +		state->active = false;
> +		arch_leave_lazy_mmu_mode();
> +	} else {
> +		/* Exiting a nested section */
> +		arch_flush_lazy_mmu_mode();
> +	}
>  }
>  
>  static inline void lazy_mmu_mode_pause(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);

nit: do you need the first condition? I think when nesting_level==0, we expect
to be !active?

> +
> +	state->active = false;
>  	arch_leave_lazy_mmu_mode();
>  }
>  
>  static inline void lazy_mmu_mode_resume(void)
>  {
> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
> +
> +	VM_WARN_ON(state->nesting_level == 0 || state->active);

Similar argument?

> +
> +	state->active = true;
>  	arch_enter_lazy_mmu_mode();
>  }
>  #else
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cbb7340c5866..11566d973f42 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>  
>  	struct page_frag		task_frag;
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +	struct lazy_mmu_state		lazy_mmu_state;
> +#endif
> +
>  #ifdef CONFIG_TASK_DELAY_ACCT
>  	struct task_delay_info		*delays;
>  #endif
> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>  	return task_index_to_char(task_state_index(tsk));
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return current->lazy_mmu_state.active;
> +}
> +#else
> +static inline bool in_lazy_mmu_mode(void)
> +{
> +	return false;

Just pointing out that this isn't really a correct implementation:

lazy_mmu_mode_enable()
ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
lazy_mmu_mode_disable()

Although it probably doesn't matter in practice?

Thanks,
Ryan

> +}
> +#endif
> +
>  extern struct pid *cad_pid;
>  
>  /*