[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXAJRx5JtTenw1Ou@e129823.arm.com>
Date: Tue, 20 Jan 2026 23:01:27 +0000
From: Yeoreum Yun <yeoreum.yun@....com>
To: Yang Shi <yang@...amperecomputing.com>
Cc: Ryan Roberts <ryan.roberts@....com>, Will Deacon <will@...nel.org>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-rt-devel@...ts.linux.dev, catalin.marinas@....com,
akpm@...ux-oundation.org, david@...nel.org, kevin.brodsky@....com,
quic_zhenhuah@...cinc.com, dev.jain@....com,
chaitanyas.prakash@....com, bigeasy@...utronix.de,
clrkwllms@...nel.org, rostedt@...dmis.org,
lorenzo.stoakes@...cle.com, ardb@...nel.org, jackmanb@...gle.com,
vbabka@...e.cz, mhocko@...e.com
Subject: Re: [PATCH v5 2/3] arm64: mmu: avoid allocating pages while
splitting the linear mapping
Hi Yang,
>
>
> On 1/20/26 1:29 AM, Yeoreum Yun wrote:
> > Hi Ryan
> > > On 19/01/2026 21:24, Yeoreum Yun wrote:
> > > > Hi Will,
> > > >
> > > > > On Mon, Jan 05, 2026 at 08:23:27PM +0000, Yeoreum Yun wrote:
> > > > > > +static int __init linear_map_prealloc_split_pgtables(void)
> > > > > > +{
> > > > > > + int ret, i;
> > > > > > + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> > > > > > + unsigned long lend = PAGE_END;
> > > > > > + unsigned long kstart = (unsigned long)lm_alias(_stext);
> > > > > > + unsigned long kend = (unsigned long)lm_alias(__init_begin);
> > > > > > +
> > > > > > + const struct mm_walk_ops collect_to_split_ops = {
> > > > > > + .pud_entry = collect_to_split_pud_entry,
> > > > > > + .pmd_entry = collect_to_split_pmd_entry
> > > > > > + };
> > > > > Why do we need to rewalk the page-table here instead of collating the
> > > > > number of block mappings we put down when creating the linear map in
> > > > > the first place?
> > > That's a good point; perhaps we can reuse the counters that this series introduces?
> > >
> > > https://lore.kernel.org/all/20260107002944.2940963-1-yang@os.amperecomputing.com/
>
> Yeah, good point. It seems feasible to me. The patch can count how many
> PUD/CONT_PMD/PMD mappings, we can calculate how many page table pages need
> to be allocated based on those counters.
>
> > >
> > > > First, linear alias of the [_text, __init_begin) is not a target for
> > > > the split and it also seems strange to me to add code inside alloc_init_XXX()
> > > > that both checks an address range and counts to get the number of block mappings.
>
> IIUC, it should be not that hard to exclude kernel mappings. We know
> kernel_start and kernel_end, so you should be able to maintain a set of
> counters for kernel, then minus them when you do the calculation for how
> many page table pages need to be allocated.
As you said, this is not difficult. However, what I meant was that
this collection would be done in alloc_init_XXX(), and in that case,
collecting the number of block mappings for the range
[kernel_start, kernel_end) and adding conditional logic in
alloc_init_XXX() seems a bit odd.
That said, for potential future use cases involving splitting specific ranges,
I don’t think having this kind of collection is necessarily a bad idea.
>
> > > >
> > > > Second, for a future feature,
> > > > I hope to add some code to split "specfic" area to be spilt e.x)
> > > > to set a specific pkey for specific area.
> > > Could you give more detail on this? My working assumption is that either the
> > > system supports BBML2 or it doesn't. If it doesn't, we need to split the whole
> > > linear map. If it does, we already have logic to split parts of the linear map
> > > when needed.
> > This is not for a linear mapping case. but for a "kernel text area".
> > As a draft, I want to mark some of kernel code can executable
> > both kernel and eBPF program.
> > (I'm trying to make eBPF program non-executable kernel code directly
> > with POE feature).
> > For this "executable area" both of kernel and eBPF program
> > -- typical example is exception entry, It need to split that specific
> > range and mark them with special POE index.
>
> IIUC, you want to change POE attributes for some kernel area (mainly in
> vmalloc address space). It sounds like you can do something like
> set_memory_rox(), but just split vmalloc address mapping instead of linear
> mapping. Or you need preallocate page table pages in this case? Anyway we
> can have more simple way to count block mappings for splitting linear
> mapping, it seems not necessary to re-walk page table again IMHO.
As I said, it isn't not only vmalloc address mapping but also
"kimage" mapping too.
In this case, it need to be split to set the specific code area
with specific POE index.
The preallocate page is for spliting via "stop_machine()"
since page table allocation with GFP_ATOMIC couldn't be in case of
PREEMPT_RT in stop_machine().
Also, the spliting text-code area to set specific POE index would be
done via stop_machine() so, the collection is required.
>
> >
> > > > In this case, it's useful to rewalk the page-table with the specific
> > > > range to get the number of block mapping.
> > > >
> > > > > > + split_pgtables_idx = 0;
> > > > > > + split_pgtables_count = 0;
> > > > > > +
> > > > > > + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> > > > > > + &collect_to_split_ops,
> > > > > > + NULL, NULL);
> > > > > > + if (!ret)
> > > > > > + ret = walk_kernel_page_table_range_lockless(kend, lend,
> > > > > > + &collect_to_split_ops,
> > > > > > + NULL, NULL);
> > > > > > + if (ret || !split_pgtables_count)
> > > > > > + goto error;
> > > > > > +
> > > > > > + ret = -ENOMEM;
> > > > > > +
> > > > > > + split_pgtables = kvmalloc(split_pgtables_count * sizeof(struct ptdesc *),
> > > > > > + GFP_KERNEL | __GFP_ZERO);
> > > > > > + if (!split_pgtables)
> > > > > > + goto error;
> > > > > > +
> > > > > > + for (i = 0; i < split_pgtables_count; i++) {
> > > > > > + /* The page table will be filled during splitting, so zeroing it is unnecessary. */
> > > > > > + split_pgtables[i] = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_ZERO, 0);
> > > > > > + if (!split_pgtables[i])
> > > > > > + goto error;
> > > > > This looks potentially expensive on the boot path and only gets worse as
> > > > > the amount of memory grows. Maybe we should predicate this preallocation
> > > > > on preempt-rt?
> > > > Agree. then I'll apply pre-allocation with PREEMPT_RT only.
> > > I guess I'm missing something obvious but I don't understand the problem here...
> > > We are only deferring the allocation of all these pgtables, so the cost is
> > > neutral surely? Had we correctly guessed that the system doesn't support BBML2
> > > earlier, we would have had to allocate all these pgtables earlier.
> > >
> > > Another way to look at it is that we are still allocating the same number of
> > > pgtables in the existing fallback path, it's just that we are doing it inside
> > > the stop_machine().
> > >
> > > My vote would be _not_ to have a separate path for PREEMPT_RT, which will end up
> > > with significantly less testing...
> > IIUC, Will's mention is additional memory allocation for
> > "split_pgtables" where saved "pre-allocate" page tables.
> > As the memory increase, definitely this size would increase the cost.
> >
> > And this cost need not to burden for !PREEMPT_RT since
> > it can use memory allocation in stop_machine() with GFP_ATOMIC.
> >
> > But I also agree in the aspect that if that cost not much of huge,
> > It's also convincing and additionally, as I mentioned in another thread,
> > It would be good not to give a hallucination GFP_ATOMIC is fine for
> > everywhere even in the PREEMPT_RT.
> >
> > --
> > Sincerely,
> > Yeoreum Yun
>
--
Sincerely,
Yeoreum Yun
Powered by blists - more mailing lists