linux-kernel - Re: [PATCH v4 07/12] mm: enable lazy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc9dc398-b9c5-4bb8-94ad-7e7f3ddd5b4f@arm.com>
Date: Mon, 10 Nov 2025 11:47:51 +0100
From: Kevin Brodsky <kevin.brodsky@....com>
To: Ryan Roberts <ryan.roberts@....com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, Alexander Gordeev <agordeev@...ux.ibm.com>,
 Andreas Larsson <andreas@...sler.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Boris Ostrovsky <boris.ostrovsky@...cle.com>, Borislav Petkov
 <bp@...en8.de>, Catalin Marinas <catalin.marinas@....com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 David Hildenbrand <david@...hat.com>, "David S. Miller"
 <davem@...emloft.net>, David Woodhouse <dwmw2@...radead.org>,
 "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
 Jann Horn <jannh@...gle.com>, Juergen Gross <jgross@...e.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Madhavan Srinivasan <maddy@...ux.ibm.com>,
 Michael Ellerman <mpe@...erman.id.au>, Michal Hocko <mhocko@...e.com>,
 Mike Rapoport <rppt@...nel.org>, Nicholas Piggin <npiggin@...il.com>,
 Peter Zijlstra <peterz@...radead.org>, Suren Baghdasaryan
 <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Vlastimil Babka <vbabka@...e.cz>, Will Deacon <will@...nel.org>,
 Yeoreum Yun <yeoreum.yun@....com>, linux-arm-kernel@...ts.infradead.org,
 linuxppc-dev@...ts.ozlabs.org, sparclinux@...r.kernel.org,
 xen-devel@...ts.xenproject.org, x86@...nel.org
Subject: Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest

On 07/11/2025 14:59, Ryan Roberts wrote:
> On 29/10/2025 10:09, Kevin Brodsky wrote:
>> [...]
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b5fdf32c437f..e6064e00b22d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -228,27 +228,86 @@ static inline int pmd_dirty(pmd_t pmd)
>>   * of the lazy mode. So the implementation must assume preemption may be enabled
>>   * and cpu migration is possible; it must take steps to be robust against this.
>>   * (In practice, for user PTE updates, the appropriate page table lock(s) are
>> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
>> - * and the mode cannot be used in interrupt context.
>> + * held, but for kernel PTE updates, no lock is held). The mode cannot be used
>> + * in interrupt context.
> "The mode cannot be used in interrupt context"; except it is for arm64. KFENCE
> and/or DEBUG_PAGEALLOC will request the arch to change linear map permissions,
> which will enter lazy mmu (now using the new generic API). This can happen in
> softirq context.

Are you happy with the wording update in patch 12?

>> + *
>> + * The lazy MMU mode is enabled for a given block of code using:
>> + *
>> + *   lazy_mmu_mode_enable();
>> + *   <code>
>> + *   lazy_mmu_mode_disable();
>> + *
>> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
>> + * A nested call to enable() has no functional effect; however disable() causes
>> + * any batched architectural state to be flushed regardless of nesting. After a
>> + * call to disable(), the caller can therefore rely on all previous page table
>> + * modifications to have taken effect, but the lazy MMU mode may still be
>> + * enabled.
>> + *
>> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
>> + * This can be done using:
>> + *
>> + *   lazy_mmu_mode_pause();
>> + *   <code>
>> + *   lazy_mmu_mode_resume();
>> + *
>> + * This sequence must only be used if the lazy MMU mode is already enabled.
>> + * pause() ensures that the mode is exited regardless of the nesting level;
>> + * resume() re-enters the mode at the same nesting level. <code> must not modify
>> + * the lazy MMU state (i.e. it must not call any of the lazy_mmu_mode_*
>> + * helpers).
>> + *
>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> + * currently enabled.
>>   */
> Nice documentation!

Thanks!

>>  #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>  static inline void lazy_mmu_mode_enable(void)
>>  {
>> -	arch_enter_lazy_mmu_mode();
>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +	VM_WARN_ON_ONCE(state->nesting_level == U8_MAX);
>> +	/* enable() must not be called while paused */
>> +	VM_WARN_ON(state->nesting_level > 0 && !state->active);
>> +
>> +	if (state->nesting_level++ == 0) {
> Hmm... for the arm64 case of calling this in an interrupt, Is it safe?
>
> If a task is calling this function and gets interrupted here, nesting_level==1
> but active==false. The interrupt then calls this function and increments from 1
> to 2 but arch_enter_lazy_mmu_mode() hasn't been called.
>
> More dangerously (I think), when the interrupt handler calls
> lazy_mmu_mode_disable(), it will end up calling arch_flush_lazy_mmu_mode() which
> could be an issue because as far as the arch is concerned, it's not in lazy mode.
>
> The current arm64 implementation works because setting and clearing the thread
> flags is atomic.
>
> Perhaps you need to disable preemption around the if block?

As you found out this is addressed in patch 12, but indeed I hadn't
realised that this patch leaves the generic API in an unsafe situation
w.r.t. interrupts. We at least need to have in_interrupt() checks in the
generic layer by the time we get to this patch.

>> +		state->active = true;
>> +		arch_enter_lazy_mmu_mode();
>> +	}
>>  }
>>  
>>  static inline void lazy_mmu_mode_disable(void)
>>  {
>> -	arch_leave_lazy_mmu_mode();
>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +	VM_WARN_ON_ONCE(state->nesting_level == 0);
>> +	VM_WARN_ON(!state->active);
>> +
>> +	if (--state->nesting_level == 0) {
>> +		state->active = false;
>> +		arch_leave_lazy_mmu_mode();
>> +	} else {
>> +		/* Exiting a nested section */
>> +		arch_flush_lazy_mmu_mode();
>> +	}
>>  }
>>  
>>  static inline void lazy_mmu_mode_pause(void)
>>  {
>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);
> nit: do you need the first condition? I think when nesting_level==0, we expect
> to be !active?

I suppose this should never happen indeed - I was just being extra
defensive.

Either way David suggested allowing pause()/resume() to be called
outside of any section so the next version will bail out on
nesting_level == 0.

>> +
>> +	state->active = false;
>>  	arch_leave_lazy_mmu_mode();
>>  }
>>  
>>  static inline void lazy_mmu_mode_resume(void)
>>  {
>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>> +
>> +	VM_WARN_ON(state->nesting_level == 0 || state->active);
> Similar argument?
>
>> +
>> +	state->active = true;
>>  	arch_enter_lazy_mmu_mode();
>>  }
>>  #else
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index cbb7340c5866..11566d973f42 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>  
>>  	struct page_frag		task_frag;
>>  
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +	struct lazy_mmu_state		lazy_mmu_state;
>> +#endif
>> +
>>  #ifdef CONFIG_TASK_DELAY_ACCT
>>  	struct task_delay_info		*delays;
>>  #endif
>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>>  	return task_index_to_char(task_state_index(tsk));
>>  }
>>  
>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> +	return current->lazy_mmu_state.active;
>> +}
>> +#else
>> +static inline bool in_lazy_mmu_mode(void)
>> +{
>> +	return false;
> Just pointing out that this isn't really a correct implementation:
>
> lazy_mmu_mode_enable()
> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
> lazy_mmu_mode_disable()
>
> Although it probably doesn't matter in practice?

I'd say that the expectation is invalid - lazy MMU mode can only be
enabled if the architecture supports it. In fact as you pointed out
above the API may be called in interrupt context but it will have no
effect, so this sequence would always fail in interrupt context.

Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
where lazy MMU is implemented. I added the fallback as a matter of
principle, but it isn't actually required.

- Kevin