linux-kernel - Re: [PATCH] arm64: mm: show direct mapping use in /proc/meminfo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97d01dfe-df9e-4780-8069-b6dfb0e8323d@arm.com>
Date: Wed, 15 Oct 2025 16:37:57 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Yang Shi <yang@...amperecomputing.com>, cl@...two.org,
 catalin.marinas@....com, will@...nel.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] arm64: mm: show direct mapping use in /proc/meminfo

On 14/10/2025 21:49, Yang Shi wrote:
> 
> 
> On 10/14/25 1:43 AM, Ryan Roberts wrote:
>> On 14/10/2025 00:51, Yang Shi wrote:
>>> Since commit a166563e7ec3 ("arm64: mm: support large block mapping when
>>> rodata=full"), the direct mapping may be split on some machines instead
>>> keeping static since boot. It makes more sense to show the direct mapping
>>> use in /proc/meminfo than before.
>> It's possible to divine this information from
>> /sys/kernel/debug/kernel_page_tables. Dev even has a script to post-process that
>> to provide summary stats on how much memory is mapped by each block size.
>>
>> Admittedly, this file is in debugfs and is not usually enabled for production
>> builds, but Dev's patch Commit fa93b45fd397 ("arm64: Enable vmalloc-huge with
>> ptdump"), merged for v6.18-rc1 gives us a path to enabling by default I think?
> 
> First of all, it is in debugfs, not all distros actually enable debugfs. IIRC,
> Android has debugfs disabled.
> 
> Secondly, dumping kernel page table is quite time consuming and costly. If the
> users just want know the direct mapping use, it sounds too overkilling.
> 
>>
>> But I can see the benefits of having this easily available, and I notice that
>> other arches are already doing this. So I guess it makes sense to be consistent.
> 
> Yeah, some other architectures show this too.
> 
>>
>>> This patch will make /proc/meminfo show the direct mapping use like the
>>> below (4K base page size):
>>> DirectMap4K:       94792 kB
>>> DirectMap64K:      134208 kB
>>> DirectMap2M:     1173504 kB
>>> DirectMap32M:     5636096 kB
>>> DirectMap1G:    529530880 kB
>> nit: x86, s390 and powerpc use a lower case "k" for DirectMap4k; perhaps we
>> should follow suit? (despite being ugly). Although I think that's usually used
>> to denote base-10 (i.e. 1000 bytes)? That's not the case here, we want base-2.
> 
> We should follow the lower case "k" so that the script used on them can work on
> arm64 too.

Thinking about this some more, I really don't love the idea of encoding the
page/block sizes in the label name. This is yet another source of potential
compat issue when running SW with 4K/16K/64K page size. Could we consider using
PTE/CONTPTE/PMD/CONTPMD/PUD instead?

> 
> Do you mean kB should be base-2?

No I meant that kB (lower case k) often means 1000x and KiB (upper case K) often
means 1024x. So having a lower case k here is technically wrong since we are
talking about 4x1024 bytes, not 4x1000 bytes.

But its a moot point - if we go for these label names, we should use the lower
case k to be compatible.

> 
>>
>>> Although just the machines which support BBML2_NOABORT can split the
>>> direct mapping, show it on all machines regardless of BBML2_NOABORT so
>>> that the users have consistent view in order to avoid confusion.
>>>
>>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>>> ---
>>>   arch/arm64/mm/mmu.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 91 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index b8d37eb037fc..e5da0f521e58 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -29,6 +29,7 @@
>>>   #include <linux/mm_inline.h>
>>>   #include <linux/pagewalk.h>
>>>   #include <linux/stop_machine.h>
>>> +#include <linux/proc_fs.h>
>>>     #include <asm/barrier.h>
>>>   #include <asm/cputype.h>
>>> @@ -51,6 +52,17 @@
>>>     DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
>>>   +enum direct_map_type {
>>> +    PTE,
>>> +    CONT_PTE,
>>> +    PMD,
>>> +    CONT_PMD,
>>> +    PUD,
>>> +    NR_DIRECT_MAP_TYPE,
>>> +};
>>> +
>>> +unsigned long direct_map_cnt[NR_DIRECT_MAP_TYPE];
>> I wonder if it would be cleaner to store this in bytes rather than blocks? Then
>> in the code you can just add PAGE_SIZE, CONT_PTE_SIZE, PMD_SIZE, etc as
>> appropriate, then the reporting function becomes simpler; everything is just
>> shifted by 10 to get kB; no need for the shift array.
> 
> Yeah, good idea.
> 
>>
>>> +
>>>   u64 kimage_voffset __ro_after_init;
>>>   EXPORT_SYMBOL(kimage_voffset);
>>>   @@ -171,6 +183,60 @@ static void init_clear_pgtable(void *table)
>>>       dsb(ishst);
>>>   }
>>>   +void arch_report_meminfo(struct seq_file *m)
>>> +{
>>> +    char *size[NR_DIRECT_MAP_TYPE];
>>> +    unsigned int shift[NR_DIRECT_MAP_TYPE];
>>> +
>>> +#if defined(CONFIG_ARM64_4K_PAGES)
>>> +    size[PTE] = "4K";
>>> +    size[CONT_PTE] = "64K";
>>> +    size[PMD] = "2M";
>>> +    size[CONT_PMD] = "32M";
>>> +    size[PUD] = "1G";
>>> +
>>> +    shift[PTE] = 2;
>>> +    shift[CONT_PTE] = 6;
>>> +    shift[PMD] = 11;
>>> +    shift[CONT_PMD] = 15;
>>> +    shift[PUD] = 20;
>>> +#elif defined(CONFIG_ARM64_16K_PAGES)
>>> +    size[PTE] = "16K";
>>> +    size[CONT_PTE] = "2M";
>>> +    size[PMD] = "32M";
>>> +    size[CONT_PMD] = "1G";
>>> +
>>> +    shift[PTE] = 4;
>>> +    shift[CONT_PTE] = 11;
>>> +    shift[PMD] = 15;
>>> +    shift[CONT_PMD] = 20;
>>> +#elif defined(CONFIG_ARM64_64K_PAGES)
>>> +    size[PTE] = "64K";
>>> +    size[CONT_PTE] = "2M";
>>> +    size[PMD] = "512M";
>>> +    size[CONT_PMD] = "16G";
>>> +
>>> +    shift[PTE] = 6;
>>> +    shift[CONT_PTE] = 11;
>>> +    shift[PMD] = 19;
>>> +    shift[CONT_PMD] = 24;
>>> +#endif
>> The ifdeffery is quite ugly. I think we can get rid of the shift array as per
>> above. I was hoping there might be a kernel function that we could pass
>> PAGE_SIZE, PMD_SIZE, etc to and it would give us an appropriate string but I
>> can't find anything. I guess keping the ifdef for size[] is the most pragmatic.
> 
> Yes, I agree this is the most pragmatic way.
> 
>>
>>> +
>>> +    seq_printf(m, "DirectMap%s:    %8lu kB\n",
>>> +            size[PTE], direct_map_cnt[PTE] << shift[PTE]);
>>> +    seq_printf(m, "DirectMap%s:    %8lu kB\n",
>>> +            size[CONT_PTE],
>>> +            direct_map_cnt[CONT_PTE] << shift[CONT_PTE]);
>>> +    seq_printf(m, "DirectMap%s:    %8lu kB\n",
>>> +            size[PMD], direct_map_cnt[PMD] << shift[PMD]);
>>> +    seq_printf(m, "DirectMap%s:    %8lu kB\n",
>>> +            size[CONT_PMD],
>>> +            direct_map_cnt[CONT_PMD] << shift[CONT_PMD]);
>>> +    if (pud_sect_supported())
>>> +        seq_printf(m, "DirectMap%s:    %8lu kB\n",
>>> +            size[PUD], direct_map_cnt[PUD] << shift[PUD]);
>>> +}
>>> +
>>>   static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>>>                phys_addr_t phys, pgprot_t prot)
>>>   {
>>> @@ -183,6 +249,9 @@ static void init_pte(pte_t *ptep, unsigned long addr,
>>> unsigned long end,
>>>            */
>>>           __set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));
>>>   +        if (!(pgprot_val(prot) & PTE_CONT))
>>> +            direct_map_cnt[PTE]++;
>> If adding size in bytes, you could just always add PAGE_SIZE here, but select
>> the bucket based on PTE_CONT?
> 
> You mean:
> 
> if (pgprot_val(prot) & PTE_CONT)
>     direct_map_cnt[PTE_CONT] += PAGE_SIZE;
> else
>     direct_map_cnt[PTE] += PAGE_SIZE;
> 
> I don't think this is efficient for PTE_CONT, because I can just do
> "direct_map_cnt[PTE_CONT] += CONT_PTE_SIZE" once in alloc_init_cont_pte()
> instead of adding PAGE_SIZE CONT_PTES times, right?

Well you could make the same argument for PTEs. you just need to calculate the
size. Then you don't need any code per-pte at all.

Thanks,
Ryan

> 
> This applies to the below comments for PMD/CONT_PMD too.
> 
> Thanks,
> Yang
> 
>>
>>> +
>>>           /*
>>>            * After the PTE entry has been populated once, we
>>>            * only allow updates to the permission attributes.
>>> @@ -229,8 +298,10 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned
>>> long addr,
>>>             /* use a contiguous mapping if the range is suitably aligned */
>>>           if ((((addr | next | phys) & ~CONT_PTE_MASK) == 0) &&
>>> -            (flags & NO_CONT_MAPPINGS) == 0)
>>> +            (flags & NO_CONT_MAPPINGS) == 0) {
>>>               __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>> +            direct_map_cnt[CONT_PTE]++;
>> Then you don't need this.
>>
>>> +        }
>>>             init_pte(ptep, addr, next, phys, __prot);
>>>   @@ -262,6 +333,9 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr,
>>> unsigned long end,
>>>               (flags & NO_BLOCK_MAPPINGS) == 0) {
>>>               pmd_set_huge(pmdp, phys, prot);
>>>   +            if (!(pgprot_val(prot) & PTE_CONT))
>>> +                direct_map_cnt[PMD]++;
>> Same here...
>>
>>> +
>>>               /*
>>>                * After the PMD entry has been populated once, we
>>>                * only allow updates to the permission attributes.
>>> @@ -317,8 +391,10 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned
>>> long addr,
>>>             /* use a contiguous mapping if the range is suitably aligned */
>>>           if ((((addr | next | phys) & ~CONT_PMD_MASK) == 0) &&
>>> -            (flags & NO_CONT_MAPPINGS) == 0)
>>> +            (flags & NO_CONT_MAPPINGS) == 0) {
>>>               __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>> +            direct_map_cnt[CONT_PMD]++;
>> Then this can go too.
>>
>>> +        }
>>>             init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>>>   @@ -368,6 +444,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>> addr, unsigned long end,
>>>               (flags & NO_BLOCK_MAPPINGS) == 0) {
>>>               pud_set_huge(pudp, phys, prot);
>>>   +            direct_map_cnt[PUD]++;
>>>               /*
>>>                * After the PUD entry has been populated once, we
>>>                * only allow updates to the permission attributes.
>>> @@ -532,9 +609,13 @@ static void split_contpte(pte_t *ptep)
>>>   {
>>>       int i;
>>>   +    direct_map_cnt[CONT_PTE]--;
>>> +
>>>       ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>>       for (i = 0; i < CONT_PTES; i++, ptep++)
>>>           __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>>> +
>>> +    direct_map_cnt[PTE] += CONT_PTES;
>>>   }
>>>     static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>>> @@ -559,8 +640,10 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp,
>>> bool to_cont)
>>>       if (to_cont)
>>>           prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>>   +    direct_map_cnt[PMD]--;
>>>       for (i = 0; i < PTRS_PER_PTE; i++, ptep++, pfn++)
>>>           __set_pte(ptep, pfn_pte(pfn, prot));
>>> +    direct_map_cnt[CONT_PTE] += PTRS_PER_PTE / CONT_PTES;
>>>         /*
>>>        * Ensure the pte entries are visible to the table walker by the time
>>> @@ -576,9 +659,13 @@ static void split_contpmd(pmd_t *pmdp)
>>>   {
>>>       int i;
>>>   +    direct_map_cnt[CONT_PMD]--;
>>> +
>>>       pmdp = PTR_ALIGN_DOWN(pmdp, sizeof(*pmdp) * CONT_PMDS);
>>>       for (i = 0; i < CONT_PMDS; i++, pmdp++)
>>>           set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>>> +
>>> +    direct_map_cnt[PMD] += CONT_PMDS;
>>>   }
>>>     static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>>> @@ -604,8 +691,10 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp,
>>> bool to_cont)
>>>       if (to_cont)
>>>           prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>>   +    direct_map_cnt[PUD]--;
>>>       for (i = 0; i < PTRS_PER_PMD; i++, pmdp++, pfn += step)
>>>           set_pmd(pmdp, pfn_pmd(pfn, prot));
>>> +    direct_map_cnt[CONT_PMD] += PTRS_PER_PMD/CONT_PMDS;
>>>         /*
>>>        * Ensure the pmd entries are visible to the table walker by the time
>