linux-kernel - Re: [PATCH] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250530123527.GA30463@willie-the-truck>
Date: Fri, 30 May 2025 13:35:27 +0100
From: Will Deacon <will@...nel.org>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Dev Jain <dev.jain@....com>, catalin.marinas@....com,
	anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
	kevin.brodsky@....com, yangyicong@...ilicon.com, joey.gouly@....com,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	david@...hat.com
Subject: Re: [PATCH] arm64: Enable vmalloc-huge with ptdump

On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
> On 30/05/2025 10:14, Dev Jain wrote:
> > 
> > On 30/05/25 2:10 pm, Ryan Roberts wrote:
> >> On 30/05/2025 09:20, Dev Jain wrote:
> >>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
> >>> because an intermediate table may be removed, potentially causing the
> >>> ptdump code to dereference an invalid address. We want to be able to
> >>> analyze block vs page mappings for kernel mappings with ptdump, so to
> >>> enable vmalloc-huge with ptdump, synchronize between page table removal in
> >>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
> >>> use mmap_read_lock and not write lock because we don't need to synchronize
> >>> between two different vm_structs; two vmalloc objects running this same
> >>> code path will point to different page tables, hence there is no race.
> > 
> > My "correction" from race->no problem was incorrect after all :) There will
> > be no race too since the vm_struct object has exclusive access to whatever
> > table it is clearing.
> > 
> >>>
> >>> Signed-off-by: Dev Jain <dev.jain@....com>
> >>> ---
> >>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
> >>>   arch/arm64/mm/mmu.c              | 7 +++++++
> >>>   2 files changed, 9 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> >>> index 38fafffe699f..28b7173d8693 100644
> >>> --- a/arch/arm64/include/asm/vmalloc.h
> >>> +++ b/arch/arm64/include/asm/vmalloc.h
> >>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
> >>>       /*
> >>>        * SW table walks can't handle removal of intermediate entries.
> >>>        */
> >>> -    return pud_sect_supported() &&
> >>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>> +    return pud_sect_supported();
> >>>   }
> >>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
> >>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
> >>>   {
> >>> -    /* See arch_vmap_pud_supported() */
> >>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> >>> +    return true;
> >>>   }
> >>>     #endif
> >>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >>> index ea6695d53fb9..798cebd9e147 100644
> >>> --- a/arch/arm64/mm/mmu.c
> >>> +++ b/arch/arm64/mm/mmu.c
> >>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
> >>>       }
> >>>         table = pte_offset_kernel(pmdp, addr);
> >>> +
> >>> +    /* Synchronize against ptdump_walk_pgd() */
> >>> +    mmap_read_lock(&init_mm);
> >>>       pmd_clear(pmdp);
> >>> +    mmap_read_unlock(&init_mm);
> >> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
> >> exclusive with any read_lock holders) for the duration of the table walk, so it
> >> will either consistently see the pgtables before or after this removal. It will
> >> never disappear during the walk, correct?
> >>
> >> I guess there is a risk of this showing up as contention with other init_mm
> >> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
> >> are called sufficiently rarely that the risk is very small. Let's fix any perf
> >> problem if/when we see it.
> > 
> > We can avoid all of that by my initial approach - to wrap the lock around
> > CONFIG_PTDUMP_DEBUGFS.
> > I don't have a strong opinion, just putting it out there.
> 
> (I wrote then failed to send earlier):
> 
> It's ugly though. Personally I'd prefer to keep it simple unless we have clear
> evidence that its needed. I was just laying out my justification for not needing
> to doing the conditional wrapping in this comment.

I really don't think we should be adding unconditional locking overhead
to core mm routines purely to facilitate a rarely used debug option.

Instead, can we either adopt something like the RCU-like walk used by
fast GUP or stick the locking behind a static key that's only enabled
when a ptdump walk is in progress (a bit like how
hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)?

Will