[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9169f4a1-1ad3-4ef0-8d16-eabebfa64cf0@arm.com>
Date: Thu, 31 Jul 2025 12:42:21 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, david@...hat.com, mark.rutland@....com,
urezki@...il.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
On 30/07/25 10:30 pm, Catalin Marinas wrote:
> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 12f534e8f3ed..e835fd437ae0 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>> /*
>> * SW table walks can't handle removal of intermediate entries.
>> */
>> - return pud_sect_supported() &&
>> - !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> + return pud_sect_supported();
> Does the "SW table walks..." comment still make sense?
Anshuman pointed that out, I missed dropping that.
>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..49932c1dd34e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
> [...]
>> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>> }
>>
>> table = pmd_offset(pudp, addr);
>> +
>> + /*
>> + * Our objective is to prevent ptdump from reading a PMD table which has
>> + * been freed. Assume that ptdump_walk_pgd() (call this thread T1)
>> + * executes completely on CPU1 and pud_free_pmd_page() (call this thread
>> + * T2) executes completely on CPU2. Let the region sandwiched by the
> I think "executes completely on CPU*" is confusing. Just talk about
> threads as they can be migrated between CPUs and the memory model is
> preserved.
Okay.
>
>> + * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> + *
>> + * Claim: The CS of T1 will never operate on a freed PMD table.
>> + *
>> + * Proof:
>> + *
>> + * Case 1: The static branch is visible to T2.
>> + *
>> + * Case 1 (a): T1 acquires the lock before T2 can.
>> + * T2 will block until T1 drops the lock, so pmd_free() will only be
>> + * executed after T1 exits CS.
> This assumes that there is some ordering between unlock and pmd_free()
> (e.g. some poisoning of the old page). The unlock only gives us release
> semantics, not acquire. It just happens that we have an atomic
> dec-and-test down the __free_pages() path but I'm not convinced we
> should rely on it unless free_pages() has clear semantics on ordering
> related to prior memory writes.
I replied in the other mail to Ryan, hopefully that logic is satisfactory.
>
>> + *
>> + * Case 1 (b): T2 acquires the lock before T1 can.
>> + * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
>> + * ensures that an empty PUD (via pud_clear()) is visible to T1 before
>> + * T1 can enter CS, therefore it is impossible for the CS to get hold
>> + * of the address of the isolated PMD table.
> Is this the effect of the barriers in the TLB flushing or the release
> semantics of the unlock?
Yeah you are right :) the release semantics of unlock guarantee that
an empty PUD will be observed before unlock, and the unlock
will have to happen before T1 can enter CS - the associativity leads us
to the conclusion.
>
>> + *
>> + * Case 2: The static branch is not visible to T2.
>> + *
>> + * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>> + * have acquire semantics, it is guaranteed that the static branch
> Does static_branch_disable() have release semantics? I think it does via
> some atomic_cmpxchg() but it's worth spelling it out.
Yes it *should* have, but this proof is already getting complicated so I thought,
let's not mention anything which is not required in the reasoning of the proof :)
>
>> + * will be visible to all CPUs before T1 can enter CS. The static
> As I mentioned on a previous version, this visibility is not absolute.
> You could say that it will be observed by other CPUs before they observe
> the write_lock.
Hmm. I probably am bad at explaining all of this in English :)
>
>> + * branch not being visible to T2 therefore guarantees that T1 has
>> + * not yet entered CS .... (i)
>> + * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>> + * implies that if the invisibility of the static branch has been
>> + * observed by T2 (i.e static_branch_unlikely() is observed as false),
>> + * then all CPUs will have observed an empty PUD ... (ii)
>> + * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>> + * before entering CS => it is impossible for the CS to get hold of
>> + * the address of the isolated PMD table. Q.E.D
> OK, I nearly got lost. That's not a straightforward memory ordering with
> as we have instruction updates with ISB as part of the TLB flushing. I
> concluded last time I looked that this branch patching part works
> because we also have kick_all_cpus_sync() as part of the static branch
> update.
>
> Stepping back to a simpler model, let's say that the static key is a
> variable. I wrote this quick test for herd7 and the Linux kernel memory
> model (fairly easy to play with):
>
> -------------------------------------8<---------------------------------------
> C pte_free_ptdump
>
> (*
> * $ cd tools/memory-model
> * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
> *)
>
> {
> pmd = 1; (* allocated pmd *)
> pte_page = 1; (* valid pte page pointed at by the pmd *)
> }
>
> // pmd_free_pte_page()
> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> WRITE_ONCE(*pmd, 0); // pmd_clear()
> smp_mb();
> if (READ_ONCE(*ptdump_lock_key)) { // static_branch() approximation
> spin_lock(init_mm); // mmap_read_lock()
> spin_unlock(init_mm); // mmap_read_unlock()
> }
> smp_mb();
> WRITE_ONCE(*pte_page, 0); // pte_free_kernel()
> }
>
> // ptdump_walk_pgd()
> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> int val;
> int page;
>
> WRITE_ONCE(*ptdump_lock_key, 1); // static_branch_enable()
> smp_mb();
> spin_lock(init_mm); // mmap_write_lock()
> val = READ_ONCE(*pmd);
> page = READ_ONCE(*pte_page); // freed pte page?
> spin_unlock(init_mm); // mmap_write_unlock()
> smp_mb();
> WRITE_ONCE(*ptdump_lock_key, 0); // static_branch_disable()
> }
>
> exists(1:val=1 /\ 1:page=0)
> -------------------------------------8<---------------------------------------
>
> I sprinkled some necessary smp_mb() but in most cases we have
> release/acquire semantics. It does show that we need a barrier before
> the page freeing. We also need to acquire for enabling the static key
> and release for disabling it.
>
> Next step is to assess that replacing the static key variable read/write
> with code patching preserves the model.
Thanks for the litmus test!
I ran it and I get the same result. But I still cannot figure out where is the hole
in my proof. I think you raise an objection to Case 2:
T1: T2:
WRITE_ONCE(*pmd, 0) : t1 WRITE_ONCE(*ptdump_lock_key, 1): t3
smp_mb() // flush_tlb_kernel_pgtable cmpxchg(acquire write lock)
smp_mb()
READ_ONCE(*ptdump_lock_key) = 0 : t2 val = READ_ONCE(*pmd) : t4
WRITE_ONCE(*pte_page, 0) page = READ_ONCE(*pte_page)
smp_mb()
cmpxchg(release write lock)
Let t_i be the global timestamps of the execution of the labelled instructions
(note that each labelled instruction is atomic so assigning a timestamp makes sense).
The fact that we read ptdump_lock_key as 0 implies that
t2 < t3 ... (i)
Now,
t1 < t2 because of barrier (ii)
t3 < t4 because of barrier (iii)
So we conclude that t1 < t4, so val will be observed as 0.
Alright I am horribly confused now :)
>
Powered by blists - more mailing lists