linux-kernel - Re: [PATCH v4 07/12] mm: enable lazy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8f70692c-25a9-4bd0-94ab-43ab435e4b1b@arm.com>
Date: Tue, 11 Nov 2025 17:03:14 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Kevin Brodsky <kevin.brodsky@....com>, linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org, Alexander Gordeev <agordeev@...ux.ibm.com>,
 Andreas Larsson <andreas@...sler.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Boris Ostrovsky <boris.ostrovsky@...cle.com>, Borislav Petkov
 <bp@...en8.de>, Catalin Marinas <catalin.marinas@....com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 David Hildenbrand <david@...hat.com>, "David S. Miller"
 <davem@...emloft.net>, David Woodhouse <dwmw2@...radead.org>,
 "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
 Jann Horn <jannh@...gle.com>, Juergen Gross <jgross@...e.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Madhavan Srinivasan <maddy@...ux.ibm.com>,
 Michael Ellerman <mpe@...erman.id.au>, Michal Hocko <mhocko@...e.com>,
 Mike Rapoport <rppt@...nel.org>, Nicholas Piggin <npiggin@...il.com>,
 Peter Zijlstra <peterz@...radead.org>, Suren Baghdasaryan
 <surenb@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Vlastimil Babka <vbabka@...e.cz>, Will Deacon <will@...nel.org>,
 Yeoreum Yun <yeoreum.yun@....com>, linux-arm-kernel@...ts.infradead.org,
 linuxppc-dev@...ts.ozlabs.org, sparclinux@...r.kernel.org,
 xen-devel@...ts.xenproject.org, x86@...nel.org
Subject: Re: [PATCH v4 07/12] mm: enable lazy_mmu sections to nest

On 11/11/2025 15:56, Kevin Brodsky wrote:
> On 11/11/2025 10:24, Ryan Roberts wrote:
>> [...]
>>
>>>>> +		state->active = true;
>>>>> +		arch_enter_lazy_mmu_mode();
>>>>> +	}
>>>>>  }
>>>>>  
>>>>>  static inline void lazy_mmu_mode_disable(void)
>>>>>  {
>>>>> -	arch_leave_lazy_mmu_mode();
>>>>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>>> +
>>>>> +	VM_WARN_ON_ONCE(state->nesting_level == 0);
>>>>> +	VM_WARN_ON(!state->active);
>>>>> +
>>>>> +	if (--state->nesting_level == 0) {
>>>>> +		state->active = false;
>>>>> +		arch_leave_lazy_mmu_mode();
>>>>> +	} else {
>>>>> +		/* Exiting a nested section */
>>>>> +		arch_flush_lazy_mmu_mode();
>>>>> +	}
>>>>>  }
>>>>>  
>>>>>  static inline void lazy_mmu_mode_pause(void)
>>>>>  {
>>>>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>>> +
>>>>> +	VM_WARN_ON(state->nesting_level == 0 || !state->active);
>>>> nit: do you need the first condition? I think when nesting_level==0, we expect
>>>> to be !active?
>>> I suppose this should never happen indeed - I was just being extra
>>> defensive.
>>>
>>> Either way David suggested allowing pause()/resume() to be called
>>> outside of any section so the next version will bail out on
>>> nesting_level == 0.
>> Ignoring my current opinion that we don't need pause/resume at all for now; Are
>> you suggesting that pause/resume will be completely independent of
>> enable/disable? I think that would be best. So enable/disable increment and
>> decrement the nesting_level counter regardless of whether we are paused.
>> nesting_level 0 => 1 enables if not paused. nesting_level 1 => 0 disables if not
>> paused. pause disables nesting_level >= 1, resume enables if nesting_level >= 1.
> 
> This is something else. Currently the rules are:
> 
> [A]
> 
> // pausing forbidden
> enable()
>     pause()
>     // pausing/enabling forbidden
>     resume()
> disable()
> 
> David suggested allowing:
> 
> [B]
> 
> pause()
> // pausing/enabling forbidden
> resume()
> 
> Your suggestion is also allowing:
> 
> [C]
> 
> pause()
>     // pausing forbidden
>     enable()
>     disable()
> resume()

I think the current kasan kasan_depopulate_vmalloc_pte() path will require [C]
if CONFIG_DEBUG_PAGEALLOC is enabled on arm64. It calls __free_page() while
paused. I guess CONFIG_DEBUG_PAGEALLOC will cause __free_page() ->
debug_pagealloc_unmap_pages() ->->-> update_range_prot() -> lazy_mmu_enable().

Arguably you could move the resume() to before the __free_page(). But it just
illustrates that it's all a bit brittle at the moment...

> 
>> Perhaps we also need nested pause/resume? Then you just end up with 2 counters;
>> enable_count and pause_count. Sorry if this has already been discussed.
> 
> And finally:
> 
> [D]
> 
> pause()
>     pause()
>         enable()
>         disable()
>     resume()
> resume()
> 
> I don't really mind either way, but I don't see an immediate use for [C]
> and [D] - the idea is that the paused section is short and controlled,
> not made up of arbitrary calls. 

If my thinking above is correct, then I've already demonstrated that this is not
the case. So I'd be inclined to go with [D] on the basis that it is the most robust.

Keeping 2 nesting counts (enable and pause) feels pretty elegant to me and gives
the fewest opportunities for surprises.

Thanks,
Ryan

> A potential downside of allowing [C] and
> [D] is that it makes it harder to detect unintended nesting (fewer
> VM_WARN assertions). Happy to implement it if this proves useful though.
> 
> OTOH the idea behind [B] is that it allows the caller of
> pause()/resume() not to care about whether lazy MMU is actually enabled
> or not - i.e. the kasan helpers would keep working even if
> apply_to_page_range() didn't use lazy MMU any more.
> 
>>>>> +
>>>>> +	state->active = false;
>>>>>  	arch_leave_lazy_mmu_mode();
>>>>>  }
>>>>>  
>>>>>  static inline void lazy_mmu_mode_resume(void)
>>>>>  {
>>>>> +	struct lazy_mmu_state *state = &current->lazy_mmu_state;
>>>>> +
>>>>> +	VM_WARN_ON(state->nesting_level == 0 || state->active);
>>>> Similar argument?
>>>>
>>>>> +
>>>>> +	state->active = true;
>>>>>  	arch_enter_lazy_mmu_mode();
>>>>>  }
>>>>>  #else
>>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>>> index cbb7340c5866..11566d973f42 100644
>>>>> --- a/include/linux/sched.h
>>>>> +++ b/include/linux/sched.h
>>>>> @@ -1441,6 +1441,10 @@ struct task_struct {
>>>>>  
>>>>>  	struct page_frag		task_frag;
>>>>>  
>>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>>> +	struct lazy_mmu_state		lazy_mmu_state;
>>>>> +#endif
>>>>> +
>>>>>  #ifdef CONFIG_TASK_DELAY_ACCT
>>>>>  	struct task_delay_info		*delays;
>>>>>  #endif
>>>>> @@ -1724,6 +1728,18 @@ static inline char task_state_to_char(struct task_struct *tsk)
>>>>>  	return task_index_to_char(task_state_index(tsk));
>>>>>  }
>>>>>  
>>>>> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>>>>> +static inline bool in_lazy_mmu_mode(void)
>>>>> +{
>>>>> +	return current->lazy_mmu_state.active;
>>>>> +}
>>>>> +#else
>>>>> +static inline bool in_lazy_mmu_mode(void)
>>>>> +{
>>>>> +	return false;
>>>> Just pointing out that this isn't really a correct implementation:
>>>>
>>>> lazy_mmu_mode_enable()
>>>> ASSERT(in_lazy_mmu_mode()) << triggers for arches without lazy mmu
>>>> lazy_mmu_mode_disable()
>>>>
>>>> Although it probably doesn't matter in practice?
>>> I'd say that the expectation is invalid - lazy MMU mode can only be
>>> enabled if the architecture supports it. In fact as you pointed out
>>> above the API may be called in interrupt context but it will have no
>>> effect, so this sequence would always fail in interrupt context.
>> Yep, but previously there was no way to query the current state so it didn't
>> matter. Now you have a query API so it might matter if/when people come along
>> and use it in unexpected ways.
> 
> I suppose the best we can do is document it alongside those helpers
> (David has already suggested some documentation, see patch 11).
> 
>>> Worth nothing that in_lazy_mmu_mode() is only ever called from arch code
>>> where lazy MMU is implemented. I added the fallback as a matter of
>>> principle, but it isn't actually required.
>> Yes, I agree that's the intent. I'm just wondering if it's possible to enforce
>> that only arch code uses this. Perhaps add some docs to explain that it's only
>> intended for arches that implement lazy_mmu, and don't define it for arches that
>> don't, which would catch any generic users?
> 
> Yep sounds like the best option - a lot less risk of misuse if it can't
> be called from generic code :) The build would still succeed on arch's
> that implement it, but the kernel CI should catch such calls sooner or
> later.
> 
> - Kevin