linux-kernel - Re: [PATCH] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4202a03d-dacd-429c-91e6-81a5d05726a4@arm.com>
Date: Fri, 30 May 2025 12:50:40 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Dev Jain <dev.jain@....com>, catalin.marinas@....com, will@...nel.org
Cc: anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
 kevin.brodsky@....com, yangyicong@...ilicon.com, joey.gouly@....com,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 david@...hat.com
Subject: Re: [PATCH] arm64: Enable vmalloc-huge with ptdump

On 30/05/2025 10:14, Dev Jain wrote:
> 
> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>> On 30/05/2025 09:20, Dev Jain wrote:
>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>>> because an intermediate table may be removed, potentially causing the
>>> ptdump code to dereference an invalid address. We want to be able to
>>> analyze block vs page mappings for kernel mappings with ptdump, so to
>>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>>> use mmap_read_lock and not write lock because we don't need to synchronize
>>> between two different vm_structs; two vmalloc objects running this same
>>> code path will point to different page tables, hence there is no race.
> 
> My "correction" from race->no problem was incorrect after all :) There will
> be no race too since the vm_struct object has exclusive access to whatever
> table it is clearing.
> 
>>>
>>> Signed-off-by: Dev Jain <dev.jain@....com>
>>> ---
>>>   arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>   arch/arm64/mm/mmu.c              | 7 +++++++
>>>   2 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>>> index 38fafffe699f..28b7173d8693 100644
>>> --- a/arch/arm64/include/asm/vmalloc.h
>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>>       /*
>>>        * SW table walks can't handle removal of intermediate entries.
>>>        */
>>> -    return pud_sect_supported() &&
>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return pud_sect_supported();
>>>   }
>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>   {
>>> -    /* See arch_vmap_pud_supported() */
>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>> +    return true;
>>>   }
>>>     #endif
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index ea6695d53fb9..798cebd9e147 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>>       }
>>>         table = pte_offset_kernel(pmdp, addr);
>>> +
>>> +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pmd_clear(pmdp);
>>> +    mmap_read_unlock(&init_mm);
>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually
>> exclusive with any read_lock holders) for the duration of the table walk, so it
>> will either consistently see the pgtables before or after this removal. It will
>> never disappear during the walk, correct?
>>
>> I guess there is a risk of this showing up as contention with other init_mm
>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page()
>> are called sufficiently rarely that the risk is very small. Let's fix any perf
>> problem if/when we see it.
> 
> We can avoid all of that by my initial approach - to wrap the lock around
> CONFIG_PTDUMP_DEBUGFS.
> I don't have a strong opinion, just putting it out there.

(I wrote then failed to send earlier):

It's ugly though. Personally I'd prefer to keep it simple unless we have clear
evidence that its needed. I was just laying out my justification for not needing
to doing the conditional wrapping in this comment.

> 
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>> And the tlbi doesn't need to be serialized because there is no security issue.
>> The walker can be trusted to only dereference memory that it sees as it walks
>> the pgtable (obviously).
>>
>>>       pte_free_kernel(NULL, table);
>>>       return 1;
>>> @@ -1289,7 +1293,10 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>           pmd_free_pte_page(pmdp, next);
>>>       } while (pmdp++, next += PMD_SIZE, next != end);
>>>   +    /* Synchronize against ptdump_walk_pgd() */
>>> +    mmap_read_lock(&init_mm);
>>>       pud_clear(pudp);
>>> +    mmap_read_unlock(&init_mm);
>> Hmm, so pud_free_pmd_page() is now going to cause us to acquire and release the
>> (upto) lock 513 times (for a 4K kernel). I wonder if there is an argument for
>> clearing the pud first (under the lock), then the pmds can all be cleared
>> without a lock, since the walker won't be able to see the pmds once the pud is
>> cleared.
> 
> Yes, we can isolate the PMD table in case the caller of pmd_free_pte_page is
> pud_free_pmd_page. In this case, vm_struct_1 has exclusive access to the entire
> pmd page, hence no race will occur. But, in case of vmap_try_huge_pmd() being the
> caller, we cannot drop the locks around pmd_free_pte_page. So we can have something
> like
> 
> #ifdef CONFIG_PTDUMP_DEBUGFS
> static inline void ptdump_synchronize_lock(bool flag)
> {
>     if (flag)
>         mmap_read_lock(&init_mm);
> }
> 
> and pass false when the caller is pud_free_pmd_page.
> 
>>
>> Thanks,
>> Ryan
>>
>>>       __flush_tlb_kernel_pgtable(addr);
>>>       pmd_free(NULL, table);
>>>       return 1;