[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7eb0fe4d-a36d-42b2-bef7-88a9d9236428@arm.com>
Date: Mon, 14 Jul 2025 20:57:31 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, david@...hat.com
Subject: Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump
On 14/07/25 2:56 pm, Dev Jain wrote:
>
> On 07/07/25 6:14 am, Catalin Marinas wrote:
>> On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
>>> On 04/07/25 4:52 pm, Catalin Marinas wrote:
>>>> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>>>>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp,
>>>>> unsigned long addr)
>>>>> }
>>>>> table = pmd_offset(pudp, addr);
>>>>> + /*
>>>>> + * Isolate the PMD table; in case of race with ptdump, this
>>>>> helps
>>>>> + * us to avoid taking the lock in __pmd_free_pte_page().
>>>>> + *
>>>>> + * Static key logic:
>>>>> + *
>>>>> + * Case 1: If ptdump does static_branch_enable(), and after
>>>>> that we
>>>>> + * execute the if block, then this patches in the read lock,
>>>>> ptdump has
>>>>> + * the write lock patched in, therefore ptdump will never
>>>>> read from
>>>>> + * a potentially freed PMD table.
>>>>> + *
>>>>> + * Case 2: If the if block starts executing before ptdump's
>>>>> + * static_branch_enable(), then no locking synchronization
>>>>> + * will be done. However, pud_clear() + the dsb() in
>>>>> + * __flush_tlb_kernel_pgtable will ensure that ptdump
>>>>> observes an
>> [...]
>>>> I don't get case 2. You want to ensure pud_clear() is observed by the
>>>> ptdump code before the pmd_free(). The DSB in the TLB flushing code
>>>> ensures some ordering between the pud_clear() and presumably something
>>>> that the ptdump code can observe as well. Would that be the mmap
>>>> semaphore? However, the read_lock would only be attempted if this code
>>>> is seeing the static branch update, which is not guaranteed. I don't
>>>> think it even matters since the lock may be released anyway before the
>>>> write_lock in ptdump.
>>>>
>>>> For example, you do a pud_clear() above, skip the whole static branch.
>>>> The ptdump comes along on another CPU but does not observe the
>>>> pud_clear() since there's no synchronisation. It goes ahead and
>>>> dereferences it while this CPU does a pmd_free().
>>> The objective is: ptdump must not dereference a freed pagetable.
>>> So for your example, if the static branch is not observed by
>>> pud_free_pmd_page, this means that ptdump will take the write lock
>>> after the execution of flush_tlb_kernel_pagetable completes (for if
>>> ptdump takes
>>> the write lock before execution of flush_tlb_kernel_pagetable
>>> completes, we have
>>> executed static_branch_enable(), contradiction).
>> I don't see why the write lock matters since pud_free_pmd_page() doesn't
>
> True.
>
>> take the read lock in the second scenario. What we need is acquire
>> semantics after the static branch update on the ptdump path but we get
>> it before we even attempt the write lock.
>>
>> For simplicity, ignoring the observability of instruction writes and
>> considering the static branch a variable, if pud_free_pmd_page() did not
>> observe the static branch update, is the ptdump guaranteed to see the
>> cleared pud subsequently?
>>
>> With initial state pud=1 (non-zero), stb=0 (static branch):
>>
>> P0 (pud_free_pmd_page) P1 (ptdump)
>>
>> W_pud=0 W_stb=1
>> DSB barrier/acq
>> R_stb=0 R_pud=?
>>
>> The write to the static branch on P1 will be ordered after the read of
>> the branch on P0, so the pud will be seen as 0. It's not even worth
>> mentioning the semaphore here as the static branch update has enough
>> barriers for cache flushing and kick_all_cpus_sync().
>>
>>
>> The other scenario is P0 (pud_free_pmd_page) observing the write to the
>> static branch (that's case 1 in your comment). This doesn't say anything
>> about whether P1 (ptdump) sees a clear or valid pud. What we do know is
>> that P0 will try to acquire (and release) the lock. If P1 already
>> acquired the write lock, P0 will wait and the state of the pud is
>> irrelevant (no freeing). Similarly if P1 already completed by the time
>> P0 takes the lock.
>>
>> If P0 takes the lock first, the lock release guarantees that the
>> pud_clear() is seen by the ptdump code _after_ it acquired the lock.
>>
>>
>> W.r.t. the visibility of the branch update vs pud access, the
>> combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
>> maintenance to PoU together with kick_all_cpus_sync() on P1 should
>> suffice.
>>
>> I think the logic works (though I'm heavily jetlagged and writing from a
>> plane) but the comments need to be improved. As described above, case 1
>> has two sub-cases depending on when P0 runs in relation to the write
>> lock (before or during/after). And the write lock doesn't matter for
>> case 2.
>>
>>>> And I can't get my head around memory ordering, it doesn't look sound.
>>>> static_branch_enable() I don't think has acquire semantics, at
>>>> least not
>>>> in relation to the actual branch update. The static_branch_unlikely()
>>>> test, again, not sure what guarantees it has (I don't think it has any
>>>> in relation to memory loads). Maybe you have worked it all out and is
>>>> fine but it needs a better explanation and ideally some simple formal
>>>> model. Show it's correct with a global variable first and then we can
>>>> optimise with static branches.
>>> What do you suggest? As in what do you mean by showing its correct with
>>> a global variable first...and, for the formal model thingy, do you
>>> want mathematical rigor similar to what you explain in [1] :P,
>>> because unfortunately
>>> (and quite embarrassingly) I didn't have formal methods in college : )
>> Neither did I ;) (mostly analogue electronics). I was thinking something
>> like our cat/herd tools where you can write some simple assembly. It's a
>> good investment if you want to give it a try.
>
>
> Will the following proof work -
>
> Proof of correctness: The below diagram represents pud_free_pmd_page
> executing on P0 and ptdump executing on P1. Note that, we can ignore
> the situation when processes migrate to another CPU, since we will
> have extra barriers because of switch_to(), and all of the embedded
> barriers that are used in the reasoning of the proof below apply
> to the inner shareable domain, and therefore will be observed by
> all CPUs, therefore it suffices to prove for the case where
> pud_free_pmd_page executes completely on P0 and ptdump
> executes completely on P1.
>
> Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the
> corresponding
> instruction to complete. Therefore from here on we do not need to use
> the term
> "observe" in a relative context. Let t_i' (t_i dash) denote the global
> timestamp
> for the corresponding instruction to start. That is, an instruction
> labelled
> with t_i implies that it started at t_i' and finished at t_i.
>
>
> P0 P1:
>
> W_PUD = 0: t0 x = 1: t2
>
> if (x == 1) {: t7 write lock: t3
> read lock: t6 R_PUD = 1: t4
> read unlock: t8 write unlock: t5
> }
>
> Free PUD: t1
>
> We need to prove that ptdump completely finishes before
> we free the PUD. Since write unlock has release semantics,
> if the write unlock finishes, it is guaranteed that ptdump
> has finished => it suffices to prove that t5 < t1'.
>
>
> R_PUD = 1 => t4 < t0 .... (i)
>
> Because of acquire semantics of down_write(rw_semaphore lock),
> t3 < t4' .... (ii)
>
> (i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)
>
> ptdump is executed on a single kernel thread, which implies
> that the transition x = 1 -> x = 1 will never happen; that is,
> when static_branch_enable() is executed, then x was 0, which
> means that the call chain static_key_enable ->
> static_key_enable_cpuslocked
> -> jump_label_update -> jump_label_can_update/
> arch_jump_label_transform_apply
> -> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be
> followed.
> The emission of dmb(ish) => t2 < t3 ... (iv)
>
> (iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish)
> ensures that t0 < t7'
> => t2 < t7' => the static branch is observed by P0 always => t6 and t8
> exist.
>
> Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with
> (iii), this gives
> us t3 < t6' => the write lock is observed first => t5 < t6 (the read
> lock cannot
> be taken before the write unlock finishes) ...(v)
>
> Release semantics of read unlock => t8 < t1' ...(vi)
> Also, trivially t6 < t8...(vii)
>
> Combining v, vi and vii, t5 < t1'. Q.E.D
>
>
For the sake of rigor, t_i should be defined w.r.t loads and stores
rather than
start and end of instruction, since theoretically, for example, t5 > t6
is possible,
if t5 has a lot of, let's say, statistics updation operations. So, the
relevant store
of write_unlock will finish, and the read lock will completely execute
before
we actually exit the write_unlock due to the instructions embedded in
write_unlock
which are not relevant to us.
>>
>
Powered by blists - more mailing lists