linux-kernel - Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c53ec40c-1fe4-4092-a876-61d5b37d8b02@arm.com>
Date: Fri, 1 Aug 2025 17:45:53 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
 ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
 joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, david@...hat.com, mark.rutland@....com,
 urezki@...il.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump


On 31/07/25 10:36 pm, Catalin Marinas wrote:
> On Thu, Jul 31, 2025 at 12:42:21PM +0530, Dev Jain wrote:
>> On 30/07/25 10:30 pm, Catalin Marinas wrote:
>>> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>>>> +	 * Case 2: The static branch is not visible to T2.
>>>> +	 *
>>>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>>>> +	 * have acquire semantics, it is guaranteed that the static branch
>>>> +	 * will be visible to all CPUs before T1 can enter CS. The static
>>> Does static_branch_disable() have release semantics? I think it does via
>>> some atomic_cmpxchg() but it's worth spelling it out.
>> Yes it *should* have, but this proof is already getting complicated so I thought,
>> let's not mention anything which is not required in the reasoning of the proof :)
> The static branch observability relative to the mmap_lock is essential,
> both when enabling and disabling the static key (i.e. you don't want the
> static branch to be seen disabled while T1 is in the critical section).
> That's why it's worth mentioning.
>
> On the static_branch_enable() path, we have a kick_all_cpus_sync(), so
> the T1 update of mmap lock will become visible after the branch update
> (of course, also subject to correct ordering on T2, see more at the end
> of the email).
>
> On the static_branch_disable() path, we need it to have release
> semantics _prior_ to disabling the key. We get this via the
> atomic_cmpxchg() in static_key_disable_cpuslocked().
>
> So I think it's all good here from the T1 perspective, just capture this
> in a comment so that we remember in a couple of months time.

Got it, thanks.

>
>>> As I mentioned on a previous version, this visibility is not absolute.
>>> You could say that it will be observed by other CPUs before they observe
>>> the write_lock.
>> Hmm. I probably am bad at explaining all of this in English :)
> It's the other way around - I don't think English is appropriate for
> explaining this ;).
>
>>>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>>>> +	 * not yet entered CS .... (i)
>>>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>>>> +	 * implies that if the invisibility of the static branch has been
>>>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>>>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>>>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>>>> +	 * before entering CS => it is impossible for the CS to get hold of
>>>> +	 * the address of the isolated PMD table. Q.E.D
>>> OK, I nearly got lost. That's not a straightforward memory ordering with
>>> as we have instruction updates with ISB as part of the TLB flushing. I
>>> concluded last time I looked that this branch patching part works
>>> because we also have kick_all_cpus_sync() as part of the static branch
>>> update.
>>>
>>> Stepping back to a simpler model, let's say that the static key is a
>>> variable. I wrote this quick test for herd7 and the Linux kernel memory
>>> model (fairly easy to play with):
>>>
>>> -------------------------------------8<---------------------------------------
>>> C pte_free_ptdump
>>>
>>> (*
>>>    * $ cd tools/memory-model
>>>    * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
>>>    *)
>>>
>>> {
>>> 	pmd = 1;				(* allocated pmd *)
>>> 	pte_page = 1;				(* valid pte page pointed at by the pmd *)
>>> }
>>>
>>> // pmd_free_pte_page()
>>> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> 	WRITE_ONCE(*pmd, 0);			// pmd_clear()
>>> 	smp_mb();
>>> 	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
>>> 		spin_lock(init_mm);		// mmap_read_lock()
>>> 		spin_unlock(init_mm);		// mmap_read_unlock()
>>> 	}
>>> 	smp_mb();
>>> 	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
>>> }
>>>
>>> // ptdump_walk_pgd()
>>> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> 	int val;
>>> 	int page;
>>>
>>> 	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
>>> 	smp_mb();
>>> 	spin_lock(init_mm);			// mmap_write_lock()
>>> 	val = READ_ONCE(*pmd);
>>> 	page = READ_ONCE(*pte_page);		// freed pte page?
>>> 	spin_unlock(init_mm);			// mmap_write_unlock()
>>> 	smp_mb();
>>> 	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
>>> }
>>>
>>> exists(1:val=1 /\ 1:page=0)
>>> -------------------------------------8<---------------------------------------
>>>
>>> I sprinkled some necessary smp_mb() but in most cases we have
>>> release/acquire semantics. It does show that we need a barrier before
>>> the page freeing. We also need to acquire for enabling the static key
>>> and release for disabling it.
> The smp_mb() before the write to *pte_page in the model above is only
> needed on the static key disabled path. As Ryan pointed out, on the
> other path we take the mmap_lock and we have acquire semantics.
>
>>> Next step is to assess that replacing the static key variable read/write
>>> with code patching preserves the model.
>> Thanks for the litmus test!
>> I ran it and I get the same result. But I still cannot figure out where is the hole
>> in my proof.
> TBH, I couldn't fully follow the proof. The arm memory model is based on
> relations between accesses (e.g. read-from, coherence order) rather than
> timestamps. They may be equivalent but that's a whole new (and complex)
> proof to write.
>
> So I'm happy with a simple English explanation and a formal model along
> the lines of the one I posted as the equivalent of a proof.
>
>> I think you raise an objection to Case 2:
> Let's follow your approach, you are missing some timestamps, adding them
> below:
>
>> T1:						T2:
>>
>> WRITE_ONCE(*pmd, 0) : t1			WRITE_ONCE(*ptdump_lock_key, 1) : t3
>> smp_mb() // flush_tlb_kernel_pgtable		cmpxchg(acquire write lock)
> 						^^^ : t3'
>> 						smp_mb()
>> READ_ONCE(*ptdump_lock_key) = 0 : t2		val = READ_ONCE(*pmd) : t4
>> WRITE_ONCE(*pte_page, 0)			page = READ_ONCE(*pte_page)
>    ^^^ : t2'					^^^ : t4'
>> 						smp_mb()
>> 						cmpxchg(release write lock)
> 						^^^ : t5
> 						smp_mb()
> 						WRITE_ONCE(*ptdump_lock_key, 0): t6
>
>> Let t_i be the global timestamps of the execution of the labelled instructions
>> (note that each labelled instruction is atomic so assigning a timestamp makes sense).
>>
>> The fact that we read ptdump_lock_key as 0 implies that
>> t2 < t3 ... (i)
> Or that t2 > t6, a plausible scenario in the litmus test above.
>
>> Now,
>> t1 < t2 because of barrier (ii)
>> t3 < t4 because of barrier (iii)
>>
>> So we conclude that t1 < t4, so val will be observed as 0.
> Without a barrier between t2 and t2', there's no guarantee that t2' >
> t2. If you remove the last smp_mb() on P0 in my litmus test, you'll see
> it failing (run herd7 with something like "-view evince -show prop" for
> a visualisation of the transitions and relations; you need graphviz
> installed as well).
>
> The way it fails is that t4 is seeing the pmd=1 prior to t1, t2 is
> reading from the t6 write and t4' is reading pte_page=0 from t2' (since
> t2' < t2 is possible without a barrier).
>
> A control dependency would work as well without a barrier, i.e.:
>
> 	if (READ_ONCE(*ptdump_lock_key)) {
> 		mmap_lock();
> 		mmap_unlock();
> 		WRITE_ONCE(*pte_page, 0);
> 	} else {
> 		WRITE_ONCE(*pte_page, 0);
> 	}
>
> but the compiler is probably free to only issue a single WRITE_ONCE()
> irrespective of the ptdump_lock_key check.
>
> Of course, using smp_load_acquire(ptdump_lock_key) would also work.
>
> However, things get fuzzier as we don't have a classic load from the
> ptdump_lock_key but rather a patched instruction. We need to guarantee
> that t2' is issued after the t2 branch when the instruction is patched.
> The kick_all_cpus_sync() on the static key disable path doesn't help us
> since P0 (T2 in your description) may see the patched branch/nop and go
> straight to the WRITE_ONCE(*pte_page). Not sure what barrier helps here
> (after more sleep, I may have a better idea tomorrow).

Got it. The hole in my proof is not with Case 2 but with Case 1: the assumption
in the reasoning is that pmd_free() will be observed after the patched-in
read lock/unlock, but that will happen when patching-in happens, for which
we need to observe the branch before the pmd_free(), but that isn't guaranteed
since there is no barrier between the if block and the pmd_free(), nor is there any
control dependency, like you describe above. So, in pud_free_pmd_page, the entire block from "pmdp = table"
till "pmd_free()" can be observed before the observation of the branch.

Reading tools/memory-model/Documentation/control-dependencies.txt, I interpret that the
compiler is free to hoist out the WRITE_ONCE() out of the control block, and then
we have the same problem, BUT I tested with herd and the test passes :)


>