[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c53ec40c-1fe4-4092-a876-61d5b37d8b02@arm.com>
Date: Fri, 1 Aug 2025 17:45:53 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, david@...hat.com, mark.rutland@....com,
urezki@...il.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump
On 31/07/25 10:36 pm, Catalin Marinas wrote:
> On Thu, Jul 31, 2025 at 12:42:21PM +0530, Dev Jain wrote:
>> On 30/07/25 10:30 pm, Catalin Marinas wrote:
>>> On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
>>>> + * Case 2: The static branch is not visible to T2.
>>>> + *
>>>> + * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>>>> + * have acquire semantics, it is guaranteed that the static branch
>>>> + * will be visible to all CPUs before T1 can enter CS. The static
>>> Does static_branch_disable() have release semantics? I think it does via
>>> some atomic_cmpxchg() but it's worth spelling it out.
>> Yes it *should* have, but this proof is already getting complicated so I thought,
>> let's not mention anything which is not required in the reasoning of the proof :)
> The static branch observability relative to the mmap_lock is essential,
> both when enabling and disabling the static key (i.e. you don't want the
> static branch to be seen disabled while T1 is in the critical section).
> That's why it's worth mentioning.
>
> On the static_branch_enable() path, we have a kick_all_cpus_sync(), so
> the T1 update of mmap lock will become visible after the branch update
> (of course, also subject to correct ordering on T2, see more at the end
> of the email).
>
> On the static_branch_disable() path, we need it to have release
> semantics _prior_ to disabling the key. We get this via the
> atomic_cmpxchg() in static_key_disable_cpuslocked().
>
> So I think it's all good here from the T1 perspective, just capture this
> in a comment so that we remember in a couple of months time.
Got it, thanks.
>
>>> As I mentioned on a previous version, this visibility is not absolute.
>>> You could say that it will be observed by other CPUs before they observe
>>> the write_lock.
>> Hmm. I probably am bad at explaining all of this in English :)
> It's the other way around - I don't think English is appropriate for
> explaining this ;).
>
>>>> + * branch not being visible to T2 therefore guarantees that T1 has
>>>> + * not yet entered CS .... (i)
>>>> + * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>>>> + * implies that if the invisibility of the static branch has been
>>>> + * observed by T2 (i.e static_branch_unlikely() is observed as false),
>>>> + * then all CPUs will have observed an empty PUD ... (ii)
>>>> + * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>>>> + * before entering CS => it is impossible for the CS to get hold of
>>>> + * the address of the isolated PMD table. Q.E.D
>>> OK, I nearly got lost. That's not a straightforward memory ordering with
>>> as we have instruction updates with ISB as part of the TLB flushing. I
>>> concluded last time I looked that this branch patching part works
>>> because we also have kick_all_cpus_sync() as part of the static branch
>>> update.
>>>
>>> Stepping back to a simpler model, let's say that the static key is a
>>> variable. I wrote this quick test for herd7 and the Linux kernel memory
>>> model (fairly easy to play with):
>>>
>>> -------------------------------------8<---------------------------------------
>>> C pte_free_ptdump
>>>
>>> (*
>>> * $ cd tools/memory-model
>>> * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
>>> *)
>>>
>>> {
>>> pmd = 1; (* allocated pmd *)
>>> pte_page = 1; (* valid pte page pointed at by the pmd *)
>>> }
>>>
>>> // pmd_free_pte_page()
>>> P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> WRITE_ONCE(*pmd, 0); // pmd_clear()
>>> smp_mb();
>>> if (READ_ONCE(*ptdump_lock_key)) { // static_branch() approximation
>>> spin_lock(init_mm); // mmap_read_lock()
>>> spin_unlock(init_mm); // mmap_read_unlock()
>>> }
>>> smp_mb();
>>> WRITE_ONCE(*pte_page, 0); // pte_free_kernel()
>>> }
>>>
>>> // ptdump_walk_pgd()
>>> P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
>>> {
>>> int val;
>>> int page;
>>>
>>> WRITE_ONCE(*ptdump_lock_key, 1); // static_branch_enable()
>>> smp_mb();
>>> spin_lock(init_mm); // mmap_write_lock()
>>> val = READ_ONCE(*pmd);
>>> page = READ_ONCE(*pte_page); // freed pte page?
>>> spin_unlock(init_mm); // mmap_write_unlock()
>>> smp_mb();
>>> WRITE_ONCE(*ptdump_lock_key, 0); // static_branch_disable()
>>> }
>>>
>>> exists(1:val=1 /\ 1:page=0)
>>> -------------------------------------8<---------------------------------------
>>>
>>> I sprinkled some necessary smp_mb() but in most cases we have
>>> release/acquire semantics. It does show that we need a barrier before
>>> the page freeing. We also need to acquire for enabling the static key
>>> and release for disabling it.
> The smp_mb() before the write to *pte_page in the model above is only
> needed on the static key disabled path. As Ryan pointed out, on the
> other path we take the mmap_lock and we have acquire semantics.
>
>>> Next step is to assess that replacing the static key variable read/write
>>> with code patching preserves the model.
>> Thanks for the litmus test!
>> I ran it and I get the same result. But I still cannot figure out where is the hole
>> in my proof.
> TBH, I couldn't fully follow the proof. The arm memory model is based on
> relations between accesses (e.g. read-from, coherence order) rather than
> timestamps. They may be equivalent but that's a whole new (and complex)
> proof to write.
>
> So I'm happy with a simple English explanation and a formal model along
> the lines of the one I posted as the equivalent of a proof.
>
>> I think you raise an objection to Case 2:
> Let's follow your approach, you are missing some timestamps, adding them
> below:
>
>> T1: T2:
>>
>> WRITE_ONCE(*pmd, 0) : t1 WRITE_ONCE(*ptdump_lock_key, 1) : t3
>> smp_mb() // flush_tlb_kernel_pgtable cmpxchg(acquire write lock)
> ^^^ : t3'
>> smp_mb()
>> READ_ONCE(*ptdump_lock_key) = 0 : t2 val = READ_ONCE(*pmd) : t4
>> WRITE_ONCE(*pte_page, 0) page = READ_ONCE(*pte_page)
> ^^^ : t2' ^^^ : t4'
>> smp_mb()
>> cmpxchg(release write lock)
> ^^^ : t5
> smp_mb()
> WRITE_ONCE(*ptdump_lock_key, 0): t6
>
>> Let t_i be the global timestamps of the execution of the labelled instructions
>> (note that each labelled instruction is atomic so assigning a timestamp makes sense).
>>
>> The fact that we read ptdump_lock_key as 0 implies that
>> t2 < t3 ... (i)
> Or that t2 > t6, a plausible scenario in the litmus test above.
>
>> Now,
>> t1 < t2 because of barrier (ii)
>> t3 < t4 because of barrier (iii)
>>
>> So we conclude that t1 < t4, so val will be observed as 0.
> Without a barrier between t2 and t2', there's no guarantee that t2' >
> t2. If you remove the last smp_mb() on P0 in my litmus test, you'll see
> it failing (run herd7 with something like "-view evince -show prop" for
> a visualisation of the transitions and relations; you need graphviz
> installed as well).
>
> The way it fails is that t4 is seeing the pmd=1 prior to t1, t2 is
> reading from the t6 write and t4' is reading pte_page=0 from t2' (since
> t2' < t2 is possible without a barrier).
>
> A control dependency would work as well without a barrier, i.e.:
>
> if (READ_ONCE(*ptdump_lock_key)) {
> mmap_lock();
> mmap_unlock();
> WRITE_ONCE(*pte_page, 0);
> } else {
> WRITE_ONCE(*pte_page, 0);
> }
>
> but the compiler is probably free to only issue a single WRITE_ONCE()
> irrespective of the ptdump_lock_key check.
>
> Of course, using smp_load_acquire(ptdump_lock_key) would also work.
>
> However, things get fuzzier as we don't have a classic load from the
> ptdump_lock_key but rather a patched instruction. We need to guarantee
> that t2' is issued after the t2 branch when the instruction is patched.
> The kick_all_cpus_sync() on the static key disable path doesn't help us
> since P0 (T2 in your description) may see the patched branch/nop and go
> straight to the WRITE_ONCE(*pte_page). Not sure what barrier helps here
> (after more sleep, I may have a better idea tomorrow).
Got it. The hole in my proof is not with Case 2 but with Case 1: the assumption
in the reasoning is that pmd_free() will be observed after the patched-in
read lock/unlock, but that will happen when patching-in happens, for which
we need to observe the branch before the pmd_free(), but that isn't guaranteed
since there is no barrier between the if block and the pmd_free(), nor is there any
control dependency, like you describe above. So, in pud_free_pmd_page, the entire block from "pmdp = table"
till "pmd_free()" can be observed before the observation of the branch.
Reading tools/memory-model/Documentation/control-dependencies.txt, I interpret that the
compiler is free to hoist out the WRITE_ONCE() out of the control block, and then
we have the same problem, BUT I tested with herd and the test passes :)
>
Powered by blists - more mailing lists