linux-kernel - Re: [PATCH 3/3] mm/pagewalk: Add pre/post_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c5722693-1fd2-4dc6-b081-5e9b4299e486@lucifer.local>
Date: Fri, 30 May 2025 13:18:18 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org, david@...hat.com,
        catalin.marinas@....com, will@...nel.org, Liam.Howlett@...cle.com,
        vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        suzuki.poulose@....com, steven.price@....com, gshan@...hat.com,
        linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for
 lazy MMU on arm64

On Fri, May 30, 2025 at 01:12:07PM +0100, Ryan Roberts wrote:
> On 30/05/2025 12:14, Lorenzo Stoakes wrote:
> > On Fri, May 30, 2025 at 02:34:07PM +0530, Dev Jain wrote:
> >> arm64 implements lazy_mmu_mode to allow deferral and batching of barriers
> >> when updating kernel PTEs, which provides a nice performance boost. arm64
> >> currently uses apply_to_page_range() to modify kernel PTE permissions,
> >> which runs inside lazy_mmu_mode. So to prevent a performance regression,
> >> let's add hooks to walk_page_range_novma() to allow continued use of
> >> lazy_mmu_mode.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@....com>
> >> ---
> >> Credits to Ryan for the patch description.
> >>
> >>  arch/arm64/mm/pageattr.c | 12 ++++++++++++
> >>  include/linux/pagewalk.h |  2 ++
> >>  mm/pagewalk.c            |  6 ++++++
> >>  3 files changed, 20 insertions(+)
> >>
> >> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> >> index a5c829c64969..9163324b12a0 100644
> >> --- a/arch/arm64/mm/pageattr.c
> >> +++ b/arch/arm64/mm/pageattr.c
> >> @@ -75,11 +75,23 @@ static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> >>  	return 0;
> >>  }
> >>
> >> +static void pte_lazy_mmu_enter(void)
> >> +{
> >> +	arch_enter_lazy_mmu_mode();
> >> +}
> >
> > Hm am I missing something? I don't see this function or the leave version
> > defined in arch/arm64?
> >
> > No do I see __HAVE_ARCH_ENTER_LAZY_MMU_MODE?
>
> arm64 is starting to use lazy_mmu_mode from v6.16 - the changes are in Linus's tree.

OK I did a pull and see it. Must be very recently merged there!

Dev - I suggest, when you rely on an upstream change that's going to land during
the merge window, to hold off on sending that series until after the merge
window.

Because if this lands in mm-new it will, currently, not actually do anything :)
and presumably this is the target.

Obviously this would be more egregious if you were relying upon something that
would result in a compile failure or similar, but it makes life easier if we
really make sure your dependencies are fulfilled.

>
> >
> >> +
> >> +static void pte_lazy_mmu_leave(void)
> >> +{
> >> +	arch_leave_lazy_mmu_mode();
> >> +}
> >
> > Are you absolutely sure you will never need to hook this stuff on higher level
> > page tables?
> >
> > If this relates to vmalloc, then we do have huge page mappings in vmalloc logic?
>
> The performance advantage (for arm64's usage at least) really only (currently)
> beneficial in practice to PTE level since we can reduce barriers by 512x. And
> apply_to_page_range() was only using lazy mmu for the pte level anyway.

Right.

>
> But actually I think we can do better here...

Oh...

>
> >
> >> +
> >>  static const struct mm_walk_ops pageattr_ops = {
> >>  	.pud_entry	= pageattr_pud_entry,
> >>  	.pmd_entry	= pageattr_pmd_entry,
> >>  	.pte_entry	= pageattr_pte_entry,
> >>  	.walk_lock	= PGWALK_NOLOCK,
> >> +	.pre_pte_table	= pte_lazy_mmu_enter,
> >> +	.post_pte_table	= pte_lazy_mmu_leave,
> >
> > This is kind of horrid really, are we sure the lazy mmu mode is valid for
> > everything that occurs within the the loop? I suppose it's only simple logic for
> > the ops->pte_entry stuff.
> >
> > But it feels like hacking something in for this specific case.
> >
> > At the same time I don't want to get in the way of an optimisation. We could do
> > something in ops->pmd_entry, but then we'd not get to turn it off afterwards...
> >
> > Same for any higher level page table hm.
> >
> > Is this really the only way to get this? I guess it's not feasible having this
> > just switched on for the whole operation...
>
> ...I think you're right. The only reason we traditionally confine the lazy mmu
> mode to a single page table is because we want to enclose it within the PTL. But
> that requirement doesn't stand for kernel mappings. As long as the walker can
> guarrantee that it doesn't allocate any memory (because with certain debug
> settings that can cause lazy mmu nesting) or try to sleep then I think we can
> just bracket the whole walk_page_range_novma() call.
>
> So I think we can avoid these new callbacks and just do:
>
> arch_enter_lazy_mmu_mode()
> walk_page_range_novma()
> arch_leave_lazy_mmu_mode()
>
> That will even give us the benefit of optimizing at PMD/PUD levels.

Yeah, this would avoid the issues I am concerned about!

>
>
> >
> > I just fear that we could end up populating these mm_walk_ops with every corner
> > case thing we think of.
> >
> >>  };
> >>
> >>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
> >> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> >> index 9bc8853ed3de..2157d345974c 100644
> >> --- a/include/linux/pagewalk.h
> >> +++ b/include/linux/pagewalk.h
> >> @@ -88,6 +88,8 @@ struct mm_walk_ops {
> >>  	int (*pre_vma)(unsigned long start, unsigned long end,
> >>  		       struct mm_walk *walk);
> >>  	void (*post_vma)(struct mm_walk *walk);
> >> +	void (*pre_pte_table)(void);
> >> +	void (*post_pte_table)(void);
>
> nit: If we did end up with this approach, I wonder if it's better to generalize
> and call it pre_table() and post_table(), passing in a level param? In any case,
> you'll at least want to pass the walk structure.

I thought the same thing, but wondered if we could just avoid doing this
altogether, which would be my preference.

And from what you say it does seem feasible!

>
> >>  	int (*install_pte)(unsigned long addr, unsigned long next,
> >>  			   pte_t *ptep, struct mm_walk *walk);
> >>  	enum page_walk_lock walk_lock;
> >> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> >> index 9657cf4664b2..a441f5cbbc45 100644
> >> --- a/mm/pagewalk.c
> >> +++ b/mm/pagewalk.c
> >> @@ -33,6 +33,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >>  	const struct mm_walk_ops *ops = walk->ops;
> >>  	int err = 0;
> >>
> >> +	if (walk->ops->pre_pte_table)
> >> +		walk->ops->pre_pte_table();
> >
> > NIT: you have 'ops' already, no need for walk-> :)
> >
> >> +
> >>  	for (;;) {
> >>  		if (ops->install_pte && pte_none(ptep_get(pte))) {
> >>  			pte_t new_pte;
> >> @@ -56,6 +59,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >>  		addr += PAGE_SIZE;
> >>  		pte++;
> >>  	}
> >> +
> >> +	if (walk->ops->post_pte_table)
> >> +		walk->ops->post_pte_table();
> >
> > NIT: same as above.
> >
> >>  	return err;
> >>  }
> >>
> >> --
> >> 2.30.2
> >>
> >
> >
>

Cheers, Lorenzo