[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c68d0449-b52a-42b7-aa5c-194af5e531e1@os.amperecomputing.com>
Date: Tue, 20 Jan 2026 16:17:40 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Will Deacon <will@...nel.org>
Cc: catalin.marinas@....com, ryan.roberts@....com, cl@...two.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: show direct mapping use in /proc/meminfo
Hi Will,
Gently ping, does the proposed change solve your concerns? If they do, I
can send v6, hopefully it can be merged in the coming merge window.
Thanks,
Yang
On 1/13/26 4:36 PM, Yang Shi wrote:
>
>
> On 1/13/26 6:36 AM, Will Deacon wrote:
>> On Tue, Jan 06, 2026 at 04:29:44PM -0800, Yang Shi wrote:
>>> Since commit a166563e7ec3 ("arm64: mm: support large block mapping when
>>> rodata=full"), the direct mapping may be split on some machines instead
>>> keeping static since boot. It makes more sense to show the direct
>>> mapping
>>> use in /proc/meminfo than before.
>>> This patch will make /proc/meminfo show the direct mapping use like the
>>> below (4K base page size):
>>> DirectMap4K: 94792 kB
>>> DirectMap64K: 134208 kB
>>> DirectMap2M: 1173504 kB
>>> DirectMap32M: 5636096 kB
>>> DirectMap1G: 529530880 kB
>>>
>>> Although just the machines which support BBML2_NOABORT can split the
>>> direct mapping, show it on all machines regardless of BBML2_NOABORT so
>>> that the users have consistent view in order to avoid confusion.
>>>
>>> Although ptdump also can tell the direct map use, but it needs to dump
>>> the whole kernel page table. It is costly and overkilling. It is also
>>> in debugfs which may not be enabled by all distros. So showing direct
>>> map use in /proc/meminfo seems more convenient and has less overhead.
>>>
>>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>>> ---
>>> v5: * Rebased to v6.19-rc4
>>> * Fixed the build error for !CONFIG_PROC_FS
>>> v4: * Used PAGE_END instead of _PAGE_END(VA_BITS_MIN) per Ryan
>>> * Used shorter name for the helpers and variables per Ryan
>>> * Fixed accounting for memory hotunplug
>>> v3: * Fixed the over-accounting problems per Ryan
>>> * Introduced helpers for add/sub direct map use and #ifdef them
>>> with
>>> CONFIG_PROC_FS per Ryan
>>> * v3 is a fix patch on top of v2
>>> v2: * Counted in size instead of the number of entries per Ryan
>>> * Removed shift array per Ryan
>>> * Use lower case "k" per Ryan
>>> * Fixed a couple of build warnings reported by kernel test robot
>>> * Fixed a couple of poential miscounts
>>>
>>> arch/arm64/mm/mmu.c | 202
>>> +++++++++++++++++++++++++++++++++++++++-----
>>> 1 file changed, 181 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 8e1d80a7033e..422441c9a992 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -29,6 +29,7 @@
>>> #include <linux/mm_inline.h>
>>> #include <linux/pagewalk.h>
>>> #include <linux/stop_machine.h>
>>> +#include <linux/proc_fs.h>
>>> #include <asm/barrier.h>
>>> #include <asm/cputype.h>
>>> @@ -171,6 +172,85 @@ static void init_clear_pgtable(void *table)
>>> dsb(ishst);
>>> }
>>> +enum dm_type {
>>> + PTE,
>>> + CONT_PTE,
>>> + PMD,
>>> + CONT_PMD,
>>> + PUD,
>>> + NR_DM_TYPE,
>>> +};
>>> +
>>> +#ifdef CONFIG_PROC_FS
>>> +static unsigned long dm_meminfo[NR_DM_TYPE];
>>> +
>>> +void arch_report_meminfo(struct seq_file *m)
>>> +{
>>> + char *size[NR_DM_TYPE];
>> const?
>
> Yeah, it can be const.
>
>>
>>> +
>>> +#if defined(CONFIG_ARM64_4K_PAGES)
>>> + size[PTE] = "4k";
>>> + size[CONT_PTE] = "64k";
>>> + size[PMD] = "2M";
>>> + size[CONT_PMD] = "32M";
>>> + size[PUD] = "1G";
>>> +#elif defined(CONFIG_ARM64_16K_PAGES)
>>> + size[PTE] = "16k";
>>> + size[CONT_PTE] = "2M";
>>> + size[PMD] = "32M";
>>> + size[CONT_PMD] = "1G";
>>> +#elif defined(CONFIG_ARM64_64K_PAGES)
>>> + size[PTE] = "64k";
>>> + size[CONT_PTE] = "2M";
>>> + size[PMD] = "512M";
>>> + size[CONT_PMD] = "16G";
>>> +#endif
>>> +
>>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>>> + size[PTE], dm_meminfo[PTE] >> 10);
>>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>>> + size[CONT_PTE],
>>> + dm_meminfo[CONT_PTE] >> 10);
>>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>>> + size[PMD], dm_meminfo[PMD] >> 10);
>>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>>> + size[CONT_PMD],
>>> + dm_meminfo[CONT_PMD] >> 10);
>>> + if (pud_sect_supported())
>>> + seq_printf(m, "DirectMap%s: %8lu kB\n",
>>> + size[PUD], dm_meminfo[PUD] >> 10);
>> This seems a bit brittle to me. If somebody adds support for l1 block
>> mappings for !4k pages in future, they will forget to update this and
>> we'll end up returning kernel stack in /proc/meminfo afaict.
>
> I can initialize size[PUD] to "NON_SUPPORT" by default. If the case
> happens, /proc/meminfo just shows "DirectMapNON_SUPPORT", then we will
> notice something is missed, but no kernel stack data will be leak.
>
>>
>>> +static inline bool is_dm_addr(unsigned long addr)
>>> +{
>>> + return (addr >= PAGE_OFFSET) && (addr < PAGE_END);
>>> +}
>>> +
>>> +static inline void dm_meminfo_add(unsigned long addr, unsigned long
>>> size,
>>> + enum dm_type type)
>>> +{
>>> + if (is_dm_addr(addr))
>>> + dm_meminfo[type] += size;
>>> +}
>>> +
>>> +static inline void dm_meminfo_sub(unsigned long addr, unsigned long
>>> size,
>>> + enum dm_type type)
>>> +{
>>> + if (is_dm_addr(addr))
>>> + dm_meminfo[type] -= size;
>>> +}
>>> +#else
>>> +static inline void dm_meminfo_add(unsigned long addr, unsigned long
>>> size,
>>> + enum dm_type type)
>>> +{
>>> +}
>>> +
>>> +static inline void dm_meminfo_sub(unsigned long addr, unsigned long
>>> size,
>>> + enum dm_type type)
>>> +{
>>> +}
>>> +#endif
>>> +
>>> static void init_pte(pte_t *ptep, unsigned long addr, unsigned
>>> long end,
>>> phys_addr_t phys, pgprot_t prot)
>>> {
>>> @@ -236,6 +316,11 @@ static int alloc_init_cont_pte(pmd_t *pmdp,
>>> unsigned long addr,
>>> init_pte(ptep, addr, next, phys, __prot);
>>> + if (pgprot_val(__prot) & PTE_CONT)
>>> + dm_meminfo_add(addr, (next - addr), CONT_PTE);
>>> + else
>>> + dm_meminfo_add(addr, (next - addr), PTE);
>>> +
>>> ptep += pte_index(next) - pte_index(addr);
>>> phys += next - addr;
>>> } while (addr = next, addr != end);
>>> @@ -266,6 +351,17 @@ static int init_pmd(pmd_t *pmdp, unsigned long
>>> addr, unsigned long end,
>>> (flags & NO_BLOCK_MAPPINGS) == 0) {
>>> pmd_set_huge(pmdp, phys, prot);
>>> + /*
>>> + * It is possible to have mappings allow cont mapping
>>> + * but disallow block mapping. For example,
>>> + * map_entry_trampoline().
>>> + * So we have to increase CONT_PMD and PMD size here
>>> + * to avoid double counting.
>>> + */
>>> + if (pgprot_val(prot) & PTE_CONT)
>>> + dm_meminfo_add(addr, (next - addr), CONT_PMD);
>>> + else
>>> + dm_meminfo_add(addr, (next - addr), PMD);
>> I don't understand the comment you're adding here. If somebody passes
>> NO_BLOCK_MAPPINGS then that also prevents contiguous entries except at
>> level 3.
>
> The comment may be misleading. I meant if we have the accounting code
> for CONT_PMD in alloc_init_cont_pmd(), for example,
>
> @@ -433,6 +433,11 @@ static int alloc_init_cont_pmd(pud_t *pudp,
> unsigned long addr,
> if (ret)
> goto out;
>
> + if (pgprot_val(prot) & PTE_CONT)
> + dm_meminfo_add(addr, (next - addr), CONT_PMD);
>
> pmdp += pmd_index(next) - pmd_index(addr);
> phys += next - addr;
> } while (addr = next, addr != end);
>
> If the described case happens, we actually miscount CONT_PMD. So I
> need to check whether it is CONT in init_pmd() instead. If the comment
> is confusing, I can just remove it.
>
>> It also doesn't look you handle the error case properly when the mapping
>> fails.
>
> I don't quite get what fail do you mean? pmd_set_huge() doesn't fail.
> Or you meant hotplug fails? If so the hot unplug will decrease the
> counters, which is called in the error handling path.
>
>>
>>> -static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
>>> +static void unmap_hotplug_pte_range(pte_t *ptep, unsigned long addr,
>>> unsigned long end, bool free_mapped,
>>> struct vmem_altmap *altmap)
>>> {
>>> - pte_t *ptep, pte;
>>> + pte_t pte;
>>> do {
>>> - ptep = pte_offset_kernel(pmdp, addr);
>>> pte = __ptep_get(ptep);
>>> if (pte_none(pte))
>>> continue;
>>> WARN_ON(!pte_present(pte));
>>> __pte_clear(&init_mm, addr, ptep);
>>> + dm_meminfo_sub(addr, PAGE_SIZE, PTE);
>>> flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>> if (free_mapped)
>>> free_hotplug_page_range(pte_page(pte),
>>> PAGE_SIZE, altmap);
>> Is the existing code correct for contiguous entries here? I'd have
>> thought that we'd need to make the range non-contiguous before knocking
>> out the TLB.
>
> Thanks for pointing this out. I didn't pay too much attention to such
> details to the existing code. Actually I did notice hot unplug code
> doesn't handle contiguous mappings, so I added
> unmap_hotplug_cont_{pmd|pte}_range() in this patch in order to
> maintain the counters correctly. I'm not sure it is intended (or maybe
> just unnecessary) or just an overlook.
>
> You are concerned this may result in misprogramming issue? TBH, I'm a
> little bit confused by the "misprogramming contiguous bit" described
> in ARM ARM. In this case, the TLB flush should remove the large
> contiguous TLB entry, so there should be no overlapping entries in
> TLB. Or this is still a problem because some entries have contiguous
> bit set, but some don't? But when we change the contiguous bit for a
> range of entries, there is always a moment that some have contiguous
> bit set, but some don't. My understanding is it is fine as long as
> there is no overlapping entries in TLB. Anyway I may misunderstand it.
>
> Thanks,
> Yang
>
>>
>> Will
>
Powered by blists - more mailing lists