linux-kernel - Re: [PATCH v5 2/3] arm64: mmu: avoid allocating pages while splitting the linear mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a18acfc-7de5-4ff8-bcce-14a3212cef75@os.amperecomputing.com>
Date: Tue, 20 Jan 2026 16:43:18 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Yeoreum Yun <yeoreum.yun@....com>
Cc: Ryan Roberts <ryan.roberts@....com>, Will Deacon <will@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-rt-devel@...ts.linux.dev, catalin.marinas@....com,
 akpm@...ux-oundation.org, david@...nel.org, kevin.brodsky@....com,
 quic_zhenhuah@...cinc.com, dev.jain@....com, chaitanyas.prakash@....com,
 bigeasy@...utronix.de, clrkwllms@...nel.org, rostedt@...dmis.org,
 lorenzo.stoakes@...cle.com, ardb@...nel.org, jackmanb@...gle.com,
 vbabka@...e.cz, mhocko@...e.com
Subject: Re: [PATCH v5 2/3] arm64: mmu: avoid allocating pages while splitting
 the linear mapping



On 1/20/26 3:01 PM, Yeoreum Yun wrote:
> Hi Yang,
>>
>> On 1/20/26 1:29 AM, Yeoreum Yun wrote:
>>> Hi Ryan
>>>> On 19/01/2026 21:24, Yeoreum Yun wrote:
>>>>> Hi Will,
>>>>>
>>>>>> On Mon, Jan 05, 2026 at 08:23:27PM +0000, Yeoreum Yun wrote:
>>>>>>> +static int __init linear_map_prealloc_split_pgtables(void)
>>>>>>> +{
>>>>>>> +	int ret, i;
>>>>>>> +	unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>>>>>>> +	unsigned long lend = PAGE_END;
>>>>>>> +	unsigned long kstart = (unsigned long)lm_alias(_stext);
>>>>>>> +	unsigned long kend = (unsigned long)lm_alias(__init_begin);
>>>>>>> +
>>>>>>> +	const struct mm_walk_ops collect_to_split_ops = {
>>>>>>> +		.pud_entry	= collect_to_split_pud_entry,
>>>>>>> +		.pmd_entry	= collect_to_split_pmd_entry
>>>>>>> +	};
>>>>>> Why do we need to rewalk the page-table here instead of collating the
>>>>>> number of block mappings we put down when creating the linear map in
>>>>>> the first place?
>>>> That's a good point; perhaps we can reuse the counters that this series introduces?
>>>>
>>>> https://lore.kernel.org/all/20260107002944.2940963-1-yang@os.amperecomputing.com/
>> Yeah, good point. It seems feasible to me. The patch can count how many
>> PUD/CONT_PMD/PMD mappings, we can calculate how many page table pages need
>> to be allocated based on those counters.
>>
>>>>> First, linear alias of the [_text, __init_begin) is not a target for
>>>>> the split and it also seems strange to me to add code inside alloc_init_XXX()
>>>>> that both checks an address range and counts to get the number of block mappings.
>> IIUC, it should be not that hard to exclude kernel mappings. We know
>> kernel_start and kernel_end, so you should be able to maintain a set of
>> counters for kernel, then minus them when you do the calculation for how
>> many page table pages need to be allocated.
> As you said, this is not difficult. However, what I meant was that
> this collection would be done in alloc_init_XXX(), and in that case,
> collecting the number of block mappings for the range
> [kernel_start, kernel_end) and adding conditional logic in
> alloc_init_XXX() seems a bit odd.
> That said, for potential future use cases involving splitting specific ranges,
> I don’t think having this kind of collection is necessarily a bad idea.

I'm not sure whether we are on the same page or not. IIUC the point is 
collecting the counts of PUD/CONT_PMD/PMD by re-walking page table is 
sub optimal and unnecessary for this usecase (repainting linear 
mapping). We can simply know the counts at linear mapping creation time.

I don't mean it is a bad idea for your future projects if it is necessary.

Thanks,
Yang

>
>>>>> Second, for a future feature,
>>>>> I hope to add some code to split "specfic" area to be spilt e.x)
>>>>> to set a specific pkey for specific area.
>>>> Could you give more detail on this? My working assumption is that either the
>>>> system supports BBML2 or it doesn't. If it doesn't, we need to split the whole
>>>> linear map. If it does, we already have logic to split parts of the linear map
>>>> when needed.
>>> This is not for a linear mapping case. but for a "kernel text area".
>>> As a draft, I want to mark some of kernel code can executable
>>> both kernel and eBPF program.
>>> (I'm trying to make eBPF program non-executable kernel code directly
>>> with POE feature).
>>> For this "executable area" both of kernel and eBPF program
>>> -- typical example is exception entry, It need to split that specific
>>> range and mark them with special POE index.
>> IIUC, you want to change POE attributes for some kernel area (mainly in
>> vmalloc address space). It sounds like you can do something like
>> set_memory_rox(), but just split vmalloc address mapping instead of linear
>> mapping. Or you need preallocate page table pages in this case? Anyway we
>> can have more simple way to count block mappings for splitting linear
>> mapping, it seems not necessary to re-walk page table again IMHO.
> As I said, it isn't not only vmalloc address mapping but also
> "kimage" mapping too.
> In this case, it need to be split to set the specific code area
> with specific POE index.
>
> The preallocate page is for spliting via "stop_machine()"
> since page table allocation with GFP_ATOMIC couldn't be in case of
> PREEMPT_RT in stop_machine().
>
> Also, the spliting text-code area to set specific POE index would be
> done via stop_machine() so, the collection is required.
>
>>>>> In this case, it's useful to rewalk the page-table with the specific
>>>>> range to get the number of block mapping.
>>>>>
>>>>>>> +	split_pgtables_idx = 0;
>>>>>>> +	split_pgtables_count = 0;
>>>>>>> +
>>>>>>> +	ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>>>>>>> +						    &collect_to_split_ops,
>>>>>>> +						    NULL, NULL);
>>>>>>> +	if (!ret)
>>>>>>> +		ret = walk_kernel_page_table_range_lockless(kend, lend,
>>>>>>> +							    &collect_to_split_ops,
>>>>>>> +							    NULL, NULL);
>>>>>>> +	if (ret || !split_pgtables_count)
>>>>>>> +		goto error;
>>>>>>> +
>>>>>>> +	ret = -ENOMEM;
>>>>>>> +
>>>>>>> +	split_pgtables = kvmalloc(split_pgtables_count * sizeof(struct ptdesc *),
>>>>>>> +				  GFP_KERNEL | __GFP_ZERO);
>>>>>>> +	if (!split_pgtables)
>>>>>>> +		goto error;
>>>>>>> +
>>>>>>> +	for (i = 0; i < split_pgtables_count; i++) {
>>>>>>> +		/* The page table will be filled during splitting, so zeroing it is unnecessary. */
>>>>>>> +		split_pgtables[i] = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_ZERO, 0);
>>>>>>> +		if (!split_pgtables[i])
>>>>>>> +			goto error;
>>>>>> This looks potentially expensive on the boot path and only gets worse as
>>>>>> the amount of memory grows. Maybe we should predicate this preallocation
>>>>>> on preempt-rt?
>>>>> Agree. then I'll apply pre-allocation with PREEMPT_RT only.
>>>> I guess I'm missing something obvious but I don't understand the problem here...
>>>> We are only deferring the allocation of all these pgtables, so the cost is
>>>> neutral surely? Had we correctly guessed that the system doesn't support BBML2
>>>> earlier, we would have had to allocate all these pgtables earlier.
>>>>
>>>> Another way to look at it is that we are still allocating the same number of
>>>> pgtables in the existing fallback path, it's just that we are doing it inside
>>>> the stop_machine().
>>>>
>>>> My vote would be _not_ to have a separate path for PREEMPT_RT, which will end up
>>>> with significantly less testing...
>>> IIUC, Will's mention is additional memory allocation for
>>> "split_pgtables" where saved "pre-allocate" page tables.
>>> As the memory increase, definitely this size would increase the cost.
>>>
>>> And this cost need not to burden for !PREEMPT_RT since
>>> it can use memory allocation in stop_machine() with GFP_ATOMIC.
>>>
>>> But I also agree in the aspect that if that cost not much of huge,
>>> It's also convincing and additionally, as I mentioned in another thread,
>>> It would be good not to give a hallucination GFP_ATOMIC is fine for
>>> everywhere even in the PREEMPT_RT.
>>>
>>> --
>>> Sincerely,
>>> Yeoreum Yun
> --
> Sincerely,
> Yeoreum Yun