[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0bb112c7-1ed0-4ee1-a1df-6a7d4b224fb6@os.amperecomputing.com>
Date: Wed, 12 Nov 2025 14:24:09 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, cl@...two.org,
catalin.marinas@....com, will@...nel.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v2 PATCH] arm64: mm: show direct mapping use in /proc/meminfo
On 11/12/25 2:16 AM, Ryan Roberts wrote:
> Hi Yang,
>
>
> On 23/10/2025 22:52, Yang Shi wrote:
>> Since commit a166563e7ec3 ("arm64: mm: support large block mapping when
>> rodata=full"), the direct mapping may be split on some machines instead
>> keeping static since boot. It makes more sense to show the direct mapping
>> use in /proc/meminfo than before.
>> This patch will make /proc/meminfo show the direct mapping use like the
>> below (4K base page size):
>> DirectMap4K: 94792 kB
>> DirectMap64K: 134208 kB
>> DirectMap2M: 1173504 kB
>> DirectMap32M: 5636096 kB
>> DirectMap1G: 529530880 kB
> I have a long-term aspiration to enable "per-process page size", where each user
> space process can use a different page size. The first step is to be able to
> emulate a page size to the process which is larger than the kernel's. For that
> reason, I really dislike introducing new ABI that exposes the geometry of the
> kernel page tables to user space. I'd really like to be clear on what use case
> benefits from this sort of information before we add it.
Thanks for the information. I'm not sure what "per-process page size"
exactly is. But isn't it just user space thing? I have hard time to
understand how exposing kernel page table geometry will have impact on it.
The direct map use information is quite useful for tracking direct map
fragmentation which may have negative impact to performance and help
diagnose and debug such issues quickly.
>
> nit: arm64 tends to use the term "linear map" not "direct map". I'm not sure why
> or what the history is. Given this is arch-specific should we be aligning on the
> architecture's terminology here? I don't know...
I actually didn't notice that. They are basically interchangeable. Just
try to keep the consistency with other architectures, for example, x86.
The users may have arm64 and x86 machines deployed at the same time and
they should prefer as few churn as possible for maintaining multiple
architectures.
>
>> Although just the machines which support BBML2_NOABORT can split the
>> direct mapping, show it on all machines regardless of BBML2_NOABORT so
>> that the users have consistent view in order to avoid confusion.
>>
>> Although ptdump also can tell the direct map use, but it needs to dump
>> the whole kernel page table. It is costly and overkilling. It is also
>> in debugfs which may not be enabled by all distros. So showing direct
>> map use in /proc/meminfo seems more convenient and has less overhead.
>>
>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>> ---
>> arch/arm64/mm/mmu.c | 86 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 86 insertions(+)
>>
>> v2: * Counted in size instead of the number of entries per Ryan
>> * Removed shift array per Ryan
>> * Use lower case "k" per Ryan
>> * Fixed a couple of build warnings reported by kernel test robot
>> * Fixed a couple of poential miscounts
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index b8d37eb037fc..7207b55d0046 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -29,6 +29,7 @@
>> #include <linux/mm_inline.h>
>> #include <linux/pagewalk.h>
>> #include <linux/stop_machine.h>
>> +#include <linux/proc_fs.h>
>>
>> #include <asm/barrier.h>
>> #include <asm/cputype.h>
>> @@ -51,6 +52,17 @@
>>
>> DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
>>
>> +enum direct_map_type {
>> + PTE,
>> + CONT_PTE,
>> + PMD,
>> + CONT_PMD,
>> + PUD,
>> + NR_DIRECT_MAP_TYPE,
>> +};
>> +
>> +static unsigned long direct_map_size[NR_DIRECT_MAP_TYPE];
> I wonder if you should wrap all the adds and subtracts into a helper function,
> which can then be defined as a nop when !CONFIG_PROC_FS. It means we only need
> direct_map_size[] when PROC_FS is enabled too.
>
> e.g.
>
> #ifdef CONFIG_PROC_FS
> static unsigned long direct_map_size[NR_DIRECT_MAP_TYPE];
>
> static inline void direct_map_meminfo_add(unsigned long size,
> enum direct_map_type type)
> {
> direct_map_size[type] += size;
> }
>
> static inline void direct_map_meminfo_sub(unsigned long size,
> enum direct_map_type type)
> {
> direct_map_size[type] -= size;
> }
> #else
> static inline void direct_map_meminfo_add(unsigned long size,
> enum direct_map_type type) {}
> static inline void direct_map_meminfo_sub(unsigned long size,
> enum direct_map_type type) {}
> #endif
>
> Then use it like this:
> direct_map_meminfo_sub(next - addr, PMD);
> direct_map_meminfo_add(next - addr, to_cont ? CONT_PTE : PTE);
Thanks for the suggestion. It seems good and it also should be able to
make solve the over-accounting problem mentioned below easier.
>
>> +
>> u64 kimage_voffset __ro_after_init;
>> EXPORT_SYMBOL(kimage_voffset);
>>
>> @@ -171,6 +183,45 @@ static void init_clear_pgtable(void *table)
>> dsb(ishst);
>> }
>>
>> +#ifdef CONFIG_PROC_FS
>> +void arch_report_meminfo(struct seq_file *m)
>> +{
>> + char *size[NR_DIRECT_MAP_TYPE];
>> +
>> +#if defined(CONFIG_ARM64_4K_PAGES)
>> + size[PTE] = "4k";
>> + size[CONT_PTE] = "64k";
>> + size[PMD] = "2M";
>> + size[CONT_PMD] = "32M";
>> + size[PUD] = "1G";
>> +#elif defined(CONFIG_ARM64_16K_PAGES)
>> + size[PTE] = "16k";
>> + size[CONT_PTE] = "2M";
>> + size[PMD] = "32M";
>> + size[CONT_PMD] = "1G";
>> +#elif defined(CONFIG_ARM64_64K_PAGES)
>> + size[PTE] = "64k";
>> + size[CONT_PTE] = "2M";
>> + size[PMD] = "512M";
>> + size[CONT_PMD] = "16G";
>> +#endif
>> +
>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>> + size[PTE], direct_map_size[PTE] >> 10);
>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>> + size[CONT_PTE],
>> + direct_map_size[CONT_PTE] >> 10);
>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>> + size[PMD], direct_map_size[PMD] >> 10);
>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>> + size[CONT_PMD],
>> + direct_map_size[CONT_PMD] >> 10);
>> + if (pud_sect_supported())
>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>> + size[PUD], direct_map_size[PUD] >> 10);
>> +}
>> +#endif
>> +
>> static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>> phys_addr_t phys, pgprot_t prot)
>> {
>> @@ -234,6 +285,11 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>
>> init_pte(ptep, addr, next, phys, __prot);
>>
>> + if (pgprot_val(__prot) & PTE_CONT)
>> + direct_map_size[CONT_PTE] += next - addr;
>> + else
>> + direct_map_size[PTE] += next - addr;
>> +
>> ptep += pte_index(next) - pte_index(addr);
>> phys += next - addr;
>> } while (addr = next, addr != end);
>> @@ -262,6 +318,17 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> (flags & NO_BLOCK_MAPPINGS) == 0) {
>> pmd_set_huge(pmdp, phys, prot);
>>
>> + /*
>> + * It is possible to have mappings allow cont mapping
>> + * but disallow block mapping. For example,
>> + * map_entry_trampoline().
>> + * So we have to increase CONT_PMD and PMD size here
>> + * to avoid double counting.
>> + */
>> + if (pgprot_val(prot) & PTE_CONT)
>> + direct_map_size[CONT_PMD] += next - addr;
>> + else
>> + direct_map_size[PMD] += next - addr;
>> /*
>> * After the PMD entry has been populated once, we
>> * only allow updates to the permission attributes.
>> @@ -368,6 +435,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>> (flags & NO_BLOCK_MAPPINGS) == 0) {
>> pud_set_huge(pudp, phys, prot);
>>
>> + direct_map_size[PUD] += next - addr;
> I think this (and all the lower levels) are likely over-accounting. For example,
> __kpti_install_ng_mappings() and map_entry_trampoline() reuse the infra to
> create separate pgtables. Then you have fixmap, which uses
> create_mapping_noalloc(), efi which uses create_pgd_mapping() and
> update_mapping_prot() used to change permissions for various parts of the kernel
> image. They all reuse the infra too.
Yes, thanks for catching this.
>
>> /*
>> * After the PUD entry has been populated once, we
>> * only allow updates to the permission attributes.
>> @@ -532,9 +600,13 @@ static void split_contpte(pte_t *ptep)
>> {
>> int i;
>>
>> + direct_map_size[CONT_PTE] -= CONT_PTE_SIZE;
>> +
>> ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>> for (i = 0; i < CONT_PTES; i++, ptep++)
>> __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>> +
>> + direct_map_size[PTE] += CONT_PTE_SIZE;
> Similar issue: we aspire to reuse this split_* infra for regions other than the
> linear map - e.g. vmalloc. So I don't like the idea of baking in an assumption
> that any split is definitely targetting the linear map.
Yeah, this needs to tell whether it is splitting linear map or not.
>
> I guess if you pass the start and end VA to the add/subtract function, it could
> fitler based on whether the region is within the linear map region?
I think it could. It seems ok for kpti, tramp and efi too because their
virtual addresses are not in the range of linear map IIUC. And it should
be able to exclude update_mapping_prot() as well because
update_mapping_prot() is just called on kernel text and data segments
whose virtual addresses are not in the range of linear map either.
And it seems using start address alone is good enough? I don't think
kernel install page table crossing virtual address space areas. So the
add/sub ops should seem like:
static inline void direct_map_meminfo_add(unsigned long start, unsigned
long size,
enum direct_map_type type)
{
if (is_linear_map_addr(start))
direct_map_use[type] += size;
}
>
> Overall, I'm personally not a huge fan of adding this capability. I'd need to
> understand the use case to change my mind. But I'm not the maintainer so perhaps
> my opinion isn't all that important ;-)
Understood. I think this is quite helpful IMHO :-) Thanks for the
valuable inputs.
Thanks,
Yang
>
> Thanks,
> Ryan
>
>> }
>>
>> static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>> @@ -559,8 +631,13 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>> if (to_cont)
>> prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>
>> + direct_map_size[PMD] -= PMD_SIZE;
>> for (i = 0; i < PTRS_PER_PTE; i++, ptep++, pfn++)
>> __set_pte(ptep, pfn_pte(pfn, prot));
>> + if (to_cont)
>> + direct_map_size[CONT_PTE] += PMD_SIZE;
>> + else
>> + direct_map_size[PTE] += PMD_SIZE;
>>
>> /*
>> * Ensure the pte entries are visible to the table walker by the time
>> @@ -576,9 +653,13 @@ static void split_contpmd(pmd_t *pmdp)
>> {
>> int i;
>>
>> + direct_map_size[CONT_PMD] -= CONT_PMD_SIZE;
>> +
>> pmdp = PTR_ALIGN_DOWN(pmdp, sizeof(*pmdp) * CONT_PMDS);
>> for (i = 0; i < CONT_PMDS; i++, pmdp++)
>> set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>> +
>> + direct_map_size[PMD] += CONT_PMD_SIZE;
>> }
>>
>> static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>> @@ -604,8 +685,13 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>> if (to_cont)
>> prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>
>> + direct_map_size[PUD] -= PUD_SIZE;
>> for (i = 0; i < PTRS_PER_PMD; i++, pmdp++, pfn += step)
>> set_pmd(pmdp, pfn_pmd(pfn, prot));
>> + if (to_cont)
>> + direct_map_size[CONT_PMD] += PUD_SIZE;
>> + else
>> + direct_map_size[PMD] += PUD_SIZE;
>>
>> /*
>> * Ensure the pmd entries are visible to the table walker by the time
Powered by blists - more mailing lists