[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed2df0cc-e02c-4376-af7a-7deac6efa9b4@arm.com>
Date: Mon, 16 Jun 2025 22:20:29 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Dev Jain <dev.jain@....com>, catalin.marinas@....com, will@...nel.org
Cc: anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
kevin.brodsky@....com, yangyicong@...ilicon.com, joey.gouly@....com,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
david@...hat.com
Subject: Re: [PATCH v3] arm64: Enable vmalloc-huge with ptdump
On 16/06/2025 19:07, Ryan Roberts wrote:
> On 16/06/2025 11:33, Dev Jain wrote:
>> arm64 disables vmalloc-huge when kernel page table dumping is enabled,
>> because an intermediate table may be removed, potentially causing the
>> ptdump code to dereference an invalid address. We want to be able to
>> analyze block vs page mappings for kernel mappings with ptdump, so to
>> enable vmalloc-huge with ptdump, synchronize between page table removal in
>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We
>> use mmap_read_lock and not write lock because we don't need to synchronize
>> between two different vm_structs; two vmalloc objects running this same
>> code path will point to different page tables, hence there is no race.
>>
>> For pud_free_pmd_page(), we isolate the PMD table to avoid taking the lock
>> 512 times again via pmd_free_pte_page().
>>
>> We implement the locking mechanism using static keys, since the chance
>> of a race is very small. Observe that the synchronization is needed
>> to avoid the following race:
>>
>> CPU1 CPU2
>> take reference of PMD table
>> pud_clear()
>> pte_free_kernel()
>> walk freed PMD table
>>
>> and similar race between pmd_free_pte_page and ptdump_walk_pgd.
>>
>> Therefore, there are two cases: if ptdump sees the cleared PUD, then
>> we are safe. If not, then the patched-in read and write locks help us
>> avoid the race.
>>
>> To implement the mechanism, we need the static key access from mmu.c and
>> ptdump.c. Note that in case !CONFIG_PTDUMP_DEBUGFS, ptdump.o won't be a
>> target in the Makefile, therefore we cannot initialize the key there, as
>> is being done, for example, in the static key implementation of
>> hugetlb-vmemmap. Therefore, include asm/cpufeature.h, which includes
>> the jump_label mechanism. Declare the key there and define the key to false
>> in mmu.c.
>>
>> No issues were observed with mm-selftests. No issues were observed while
>> parallelly running test_vmalloc.sh and dumping the kernel pagetable through
>> sysfs in a loop.
>>
>> v2->v3:
>> - Use static key mechanism
>>
>> v1->v2:
>> - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>> - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>> the lock 512 times again via pmd_free_pte_page()
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 1 +
>> arch/arm64/mm/mmu.c | 51 ++++++++++++++++++++++++++---
>> arch/arm64/mm/ptdump.c | 5 +++
>> 3 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index c4326f1cb917..3e386563b587 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -26,6 +26,7 @@
>> #include <linux/kernel.h>
>> #include <linux/cpumask.h>
>>
>> +DECLARE_STATIC_KEY_FALSE(ptdump_lock_key);
>> /*
>> * CPU feature register tracking
>> *
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 8fcf59ba39db..e242ba428820 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -41,11 +41,14 @@
>> #include <asm/tlbflush.h>
>> #include <asm/pgalloc.h>
>> #include <asm/kfence.h>
>> +#include <asm/cpufeature.h>
>>
>> #define NO_BLOCK_MAPPINGS BIT(0)
>> #define NO_CONT_MAPPINGS BIT(1)
>> #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
>>
>> +DEFINE_STATIC_KEY_FALSE(ptdump_lock_key);
>> +
>> enum pgtable_type {
>> TABLE_PTE,
>> TABLE_PMD,
>> @@ -1267,8 +1270,9 @@ int pmd_clear_huge(pmd_t *pmdp)
>> return 1;
>> }
>>
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)
>> {
>> + bool lock_taken = false;
>> pte_t *table;
>> pmd_t pmd;
>>
>> @@ -1279,15 +1283,29 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> return 1;
>> }
>>
>> + /* See comment in pud_free_pmd_page for static key logic */
>> table = pte_offset_kernel(pmdp, addr);
>> pmd_clear(pmdp);
>> __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key) && lock) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pte_free_kernel(NULL, table);
>> return 1;
>> }
>>
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> + return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>> int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> {
>> + bool lock_taken = false;
>> pmd_t *table;
>> pmd_t *pmdp;
>> pud_t pud;
>> @@ -1301,15 +1319,40 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> }
>>
>> table = pmd_offset(pudp, addr);
>> + /*
>> + * Isolate the PMD table; in case of race with ptdump, this helps
>> + * us to avoid taking the lock in __pmd_free_pte_page().
>> + *
>> + * Static key logic:
>> + *
>> + * Case 1: If ptdump does static_branch_enable(), and after that we
>> + * execute the if block, then this patches in the read lock, ptdump has
>> + * the write lock patched in, therefore ptdump will never read from
>> + * a potentially freed PMD table.
>> + *
>> + * Case 2: If the if block starts executing before ptdump's
>> + * static_branch_enable(), then no locking synchronization
>> + * will be done. However, pud_clear() + the dsb() in
>> + * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
>> + * empty PUD. Thus, it will never walk over a potentially freed
>> + * PMD table.
>> + */
>> + pud_clear(pudp);
>
> How can this possibly be correct; you're clearing the pud without any
> synchronisation. So you could have this situation:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
> pud_free_pmd_page()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> Surely the logic needs to be:
>
> if (static_branch_unlikely(&ptdump_lock_key)) {
> mmap_read_lock(&init_mm);
> lock_taken = true;
> }
> pud_clear(pudp);
> if (unlikely(lock_taken))
> mmap_read_unlock(&init_mm);
>
> That fixes your first case, I think? But doesn't fix your second case. You could
> still have:
>
> CPU1 (vmalloc) CPU2 (ptdump)
>
> pud_free_pmd_page()
> <ptdump_lock_key=FALSE>
> static_branch_enable()
> mmap_write_lock()
> pud = pudp_get()
> pud_clear()
> access the table pointed to by pud
> BANG!
>
> I think what you need is some sort of RCU read-size critical section in the
> vmalloc side that you can then synchonize on in the ptdump side. But you would
> need to be in the read side critical section when you sample the static key, but
> you can't sleep waiting for the mmap lock while in the critical section. This
> feels solvable, and there is almost certainly a well-used pattern, but I'm not
> quite sure what the answer is. Perhaps others can help...
Just taking a step back here, I found the "percpu rw semaphore". From the
documentation:
"""
Percpu rw semaphores is a new read-write semaphore design that is
optimized for locking for reading.
The problem with traditional read-write semaphores is that when multiple
cores take the lock for reading, the cache line containing the semaphore
is bouncing between L1 caches of the cores, causing performance
degradation.
Locking for reading is very fast, it uses RCU and it avoids any atomic
instruction in the lock and unlock path. On the other hand, locking for
writing is very expensive, it calls synchronize_rcu() that can take
hundreds of milliseconds.
"""
Perhaps this provides the properties we are looking for? Could just define one
of these and lock it in read mode around pXd_clear() on the vmalloc side. Then
lock it in write mode around ptdump_walk_pgd() on the ptdump side. No need for
static key or other hoops. Given its a dedicated lock, there is no risk of
accidental contention because no other code is using it.
>
> Thanks,
> Ryan
>
>
>> + __flush_tlb_kernel_pgtable(addr);
>> + if (static_branch_unlikely(&ptdump_lock_key)) {
>> + mmap_read_lock(&init_mm);
>> + lock_taken = true;
>> + }
>> + if (unlikely(lock_taken))
>> + mmap_read_unlock(&init_mm);
>> +
>> pmdp = table;
>> next = addr;
>> end = addr + PUD_SIZE;
>> do {
>> - pmd_free_pte_page(pmdp, next);
>> + __pmd_free_pte_page(pmdp, next, false);
>> } while (pmdp++, next += PMD_SIZE, next != end);
>>
>> - pud_clear(pudp);
>> - __flush_tlb_kernel_pgtable(addr);
>> pmd_free(NULL, table);
>> return 1;
>> }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..f75e12a1d068 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -25,6 +25,7 @@
>> #include <asm/pgtable-hwdef.h>
>> #include <asm/ptdump.h>
>>
>> +#include <asm/cpufeature.h>
>>
>> #define pt_dump_seq_printf(m, fmt, args...) \
>> ({ \
>> @@ -311,7 +312,9 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>> }
>>
>> static void __init ptdump_initialize(void)
>> @@ -353,7 +356,9 @@ bool ptdump_check_wx(void)
>> }
>> };
>>
>> + static_branch_enable(&ptdump_lock_key);
>> ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> + static_branch_disable(&ptdump_lock_key);
>>
>> if (st.wx_pages || st.uxn_pages) {
>> pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
>
Powered by blists - more mailing lists