linux-kernel - Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9169f4a1-1ad3-4ef0-8d16-eabebfa64cf0@arm.com>
Date: Thu, 31 Jul 2025 12:42:21 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
 ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
 joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, david@...hat.com, mark.rutland@....com,
 urezki@...il.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump


On 30/07/25 10:30 pm, Catalin Marinas wrote:
> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 12f534e8f3ed..e835fd437ae0 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
> Does the "SW table walks..." comment still make sense?

Anshuman pointed that out, I missed dropping that.

>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..49932c1dd34e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
> [...]
>> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +
>> +	/*
>> +	 * Our objective is to prevent ptdump from reading a PMD table which has
>> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
>> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
>> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
> I think "executes completely on CPU*" is confusing. Just talk about
> threads as they can be migrated between CPUs and the memory model is
> preserved.

Okay.

>
>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> +	 *
>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>> +	 *
>> +	 * Proof:
>> +	 *
>> +	 * Case 1: The static branch is visible to T2.
>> +	 *
>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>> +	 * executed after T1 exits CS.
> This assumes that there is some ordering between unlock and pmd_free()
> (e.g. some poisoning of the old page). The unlock only gives us release
> semantics, not acquire. It just happens that we have an atomic
> dec-and-test down the __free_pages() path but I'm not convinced we
> should rely on it unless free_pages() has clear semantics on ordering
> related to prior memory writes.

I replied in the other mail to Ryan, hopefully that logic is satisfactory.

>
>> +	 *
>> +	 * Case 1 (b): T2 acquires the lock before T1 can.
>> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
>> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
>> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
>> +	 * of the address of the isolated PMD table.
> Is this the effect of the barriers in the TLB flushing or the release
> semantics of the unlock?

Yeah you are right :) the release semantics of unlock guarantee that
an empty PUD will be observed before unlock, and the unlock
will have to happen before T1 can enter CS - the associativity leads us
to the conclusion.

>
>> +	 *
>> +	 * Case 2: The static branch is not visible to T2.
>> +	 *
>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>> +	 * have acquire semantics, it is guaranteed that the static branch
> Does static_branch_disable() have release semantics? I think it does via
> some atomic_cmpxchg() but it's worth spelling it out.

Yes it *should* have, but this proof is already getting complicated so I thought,
let's not mention anything which is not required in the reasoning of the proof :)

>
>> +	 * will be visible to all CPUs before T1 can enter CS. The static
> As I mentioned on a previous version, this visibility is not absolute.
> You could say that it will be observed by other CPUs before they observe
> the write_lock.

Hmm. I probably am bad at explaining all of this in English :)

>
>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>> +	 * not yet entered CS .... (i)
>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>> +	 * implies that if the invisibility of the static branch has been
>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>> +	 * before entering CS => it is impossible for the CS to get hold of
>> +	 * the address of the isolated PMD table. Q.E.D
> OK, I nearly got lost. That's not a straightforward memory ordering with
> as we have instruction updates with ISB as part of the TLB flushing. I
> concluded last time I looked that this branch patching part works
> because we also have kick_all_cpus_sync() as part of the static branch
> update.
>
> Stepping back to a simpler model, let's say that the static key is a
> variable. I wrote this quick test for herd7 and the Linux kernel memory
> model (fairly easy to play with):
>
> -------------------------------------8<---------------------------------------
> C pte_free_ptdump
>
> (*
>   * $ cd tools/memory-model
>   * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
>   *)
>
> {
> 	pmd = 1;				(* allocated pmd *)
> 	pte_page = 1;				(* valid pte page pointed at by the pmd *)
> }
>
> // pmd_free_pte_page()
> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> 	WRITE_ONCE(*pmd, 0);			// pmd_clear()
> 	smp_mb();
> 	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
> 		spin_lock(init_mm);		// mmap_read_lock()
> 		spin_unlock(init_mm);		// mmap_read_unlock()
> 	}
> 	smp_mb();
> 	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
> }
>
> // ptdump_walk_pgd()
> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
> {
> 	int val;
> 	int page;
>
> 	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
> 	smp_mb();
> 	spin_lock(init_mm);			// mmap_write_lock()
> 	val = READ_ONCE(*pmd);
> 	page = READ_ONCE(*pte_page);		// freed pte page?
> 	spin_unlock(init_mm);			// mmap_write_unlock()
> 	smp_mb();
> 	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
> }
>
> exists(1:val=1 /\ 1:page=0)
> -------------------------------------8<---------------------------------------
>
> I sprinkled some necessary smp_mb() but in most cases we have
> release/acquire semantics. It does show that we need a barrier before
> the page freeing. We also need to acquire for enabling the static key
> and release for disabling it.
>
> Next step is to assess that replacing the static key variable read/write
> with code patching preserves the model.

Thanks for the litmus test!
I ran it and I get the same result. But I still cannot figure out where is the hole
in my proof. I think you raise an objection to Case 2:

T1:						T2:

WRITE_ONCE(*pmd, 0) : t1			WRITE_ONCE(*ptdump_lock_key, 1): t3
smp_mb() // flush_tlb_kernel_pgtable		cmpxchg(acquire write lock)
						smp_mb()
READ_ONCE(*ptdump_lock_key) = 0 : t2		val = READ_ONCE(*pmd) : t4
WRITE_ONCE(*pte_page, 0)			page = READ_ONCE(*pte_page)
						smp_mb()
						cmpxchg(release write lock)


Let t_i be the global timestamps of the execution of the labelled instructions
(note that each labelled instruction is atomic so assigning a timestamp makes sense).

The fact that we read ptdump_lock_key as 0 implies that
t2 < t3 ... (i)
Now,
t1 < t2 because of barrier (ii)
t3 < t4 because of barrier (iii)

So we conclude that t1 < t4, so val will be observed as 0.

Alright I am horribly confused now :)


>