linux-kernel - Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7eb0fe4d-a36d-42b2-bef7-88a9d9236428@arm.com>
Date: Mon, 14 Jul 2025 20:57:31 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
 ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
 joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, david@...hat.com
Subject: Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump


On 14/07/25 2:56 pm, Dev Jain wrote:
>
> On 07/07/25 6:14 am, Catalin Marinas wrote:
>> On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
>>> On 04/07/25 4:52 pm, Catalin Marinas wrote:
>>>> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>>>>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, 
>>>>> unsigned long addr)
>>>>>        }
>>>>>        table = pmd_offset(pudp, addr);
>>>>> +    /*
>>>>> +     * Isolate the PMD table; in case of race with ptdump, this 
>>>>> helps
>>>>> +     * us to avoid taking the lock in __pmd_free_pte_page().
>>>>> +     *
>>>>> +     * Static key logic:
>>>>> +     *
>>>>> +     * Case 1: If ptdump does static_branch_enable(), and after 
>>>>> that we
>>>>> +     * execute the if block, then this patches in the read lock, 
>>>>> ptdump has
>>>>> +     * the write lock patched in, therefore ptdump will never 
>>>>> read from
>>>>> +     * a potentially freed PMD table.
>>>>> +     *
>>>>> +     * Case 2: If the if block starts executing before ptdump's
>>>>> +     * static_branch_enable(), then no locking synchronization
>>>>> +     * will be done. However, pud_clear() + the dsb() in
>>>>> +     * __flush_tlb_kernel_pgtable will ensure that ptdump 
>>>>> observes an
>> [...]
>>>> I don't get case 2. You want to ensure pud_clear() is observed by the
>>>> ptdump code before the pmd_free(). The DSB in the TLB flushing code
>>>> ensures some ordering between the pud_clear() and presumably something
>>>> that the ptdump code can observe as well. Would that be the mmap
>>>> semaphore? However, the read_lock would only be attempted if this code
>>>> is seeing the static branch update, which is not guaranteed. I don't
>>>> think it even matters since the lock may be released anyway before the
>>>> write_lock in ptdump.
>>>>
>>>> For example, you do a pud_clear() above, skip the whole static branch.
>>>> The ptdump comes along on another CPU but does not observe the
>>>> pud_clear() since there's no synchronisation. It goes ahead and
>>>> dereferences it while this CPU does a pmd_free().
>>> The objective is: ptdump must not dereference a freed pagetable.
>>> So for your example, if the static branch is not observed by
>>> pud_free_pmd_page, this means that ptdump will take the write lock
>>> after the execution of flush_tlb_kernel_pagetable completes (for if 
>>> ptdump takes
>>> the write lock before execution of flush_tlb_kernel_pagetable 
>>> completes, we have
>>> executed static_branch_enable(), contradiction).
>> I don't see why the write lock matters since pud_free_pmd_page() doesn't
>
> True.
>
>> take the read lock in the second scenario. What we need is acquire
>> semantics after the static branch update on the ptdump path but we get
>> it before we even attempt the write lock.
>>
>> For simplicity, ignoring the observability of instruction writes and
>> considering the static branch a variable, if pud_free_pmd_page() did not
>> observe the static branch update, is the ptdump guaranteed to see the
>> cleared pud subsequently?
>>
>> With initial state pud=1 (non-zero), stb=0 (static branch):
>>
>> P0 (pud_free_pmd_page)        P1 (ptdump)
>>
>>      W_pud=0               W_stb=1
>>      DSB                   barrier/acq
>>      R_stb=0               R_pud=?
>>
>> The write to the static branch on P1 will be ordered after the read of
>> the branch on P0, so the pud will be seen as 0. It's not even worth
>> mentioning the semaphore here as the static branch update has enough
>> barriers for cache flushing and kick_all_cpus_sync().
>>
>>
>> The other scenario is P0 (pud_free_pmd_page) observing the write to the
>> static branch (that's case 1 in your comment). This doesn't say anything
>> about whether P1 (ptdump) sees a clear or valid pud. What we do know is
>> that P0 will try to acquire (and release) the lock. If P1 already
>> acquired the write lock, P0 will wait and the state of the pud is
>> irrelevant (no freeing). Similarly if P1 already completed by the time
>> P0 takes the lock.
>>
>> If P0 takes the lock first, the lock release guarantees that the
>> pud_clear() is seen by the ptdump code _after_ it acquired the lock.
>>
>>
>> W.r.t. the visibility of the branch update vs pud access, the
>> combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
>> maintenance to PoU together with kick_all_cpus_sync() on P1 should
>> suffice.
>>
>> I think the logic works (though I'm heavily jetlagged and writing from a
>> plane) but the comments need to be improved. As described above, case 1
>> has two sub-cases depending on when P0 runs in relation to the write
>> lock (before or during/after). And the write lock doesn't matter for
>> case 2.
>>
>>>> And I can't get my head around memory ordering, it doesn't look sound.
>>>> static_branch_enable() I don't think has acquire semantics, at 
>>>> least not
>>>> in relation to the actual branch update. The static_branch_unlikely()
>>>> test, again, not sure what guarantees it has (I don't think it has any
>>>> in relation to memory loads). Maybe you have worked it all out and is
>>>> fine but it needs a better explanation and ideally some simple formal
>>>> model. Show it's correct with a global variable first and then we can
>>>> optimise with static branches.
>>> What do you suggest? As in what do you mean by showing its correct with
>>> a global variable first...and, for the formal model thingy, do you
>>> want mathematical rigor similar to what you explain in [1] :P, 
>>> because unfortunately
>>> (and quite embarrassingly) I didn't have formal methods in college : )
>> Neither did I ;) (mostly analogue electronics). I was thinking something
>> like our cat/herd tools where you can write some simple assembly. It's a
>> good investment if you want to give it a try.
>
>
> Will the following proof work -
>
> Proof of correctness: The below diagram represents pud_free_pmd_page
> executing on P0 and ptdump executing on P1. Note that, we can ignore
> the situation when processes migrate to another CPU, since we will
> have extra barriers because of switch_to(), and all of the embedded
> barriers that are used in the reasoning of the proof below apply
> to the inner shareable domain, and therefore will be observed by
> all CPUs, therefore it suffices to prove for the case where
> pud_free_pmd_page executes completely on P0 and ptdump
> executes completely on P1.
>
> Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the 
> corresponding
> instruction to complete. Therefore from here on we do not need to use 
> the term
> "observe" in a relative context. Let t_i' (t_i dash) denote the global 
> timestamp
> for the corresponding instruction to start. That is, an instruction 
> labelled
> with t_i implies that it started at t_i' and finished at t_i.
>
>
> P0                P1:
>
> W_PUD = 0: t0            x = 1: t2
>
> if (x == 1) {: t7            write lock: t3
>     read lock: t6        R_PUD = 1: t4
>     read unlock: t8        write unlock: t5
> }
>
> Free PUD: t1
>
> We need to prove that ptdump completely finishes before
> we free the PUD. Since write unlock has release semantics,
> if the write unlock finishes, it is guaranteed that ptdump
> has finished => it suffices to prove that t5 < t1'.
>
>
> R_PUD = 1 => t4 < t0 .... (i)
>
> Because of acquire semantics of down_write(rw_semaphore lock),
> t3 < t4' .... (ii)
>
> (i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)
>
> ptdump is executed on a single kernel thread, which implies
> that the transition x = 1 -> x = 1 will never happen; that is,
> when static_branch_enable() is executed, then x was 0, which
> means that the call chain static_key_enable -> 
> static_key_enable_cpuslocked
> -> jump_label_update -> jump_label_can_update/ 
> arch_jump_label_transform_apply
> -> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be 
> followed.
> The emission of dmb(ish) => t2 < t3 ... (iv)
>
> (iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish) 
> ensures that t0 < t7'
> => t2 < t7' => the static branch is observed by P0 always => t6 and t8 
> exist.
>
> Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with 
> (iii), this gives
> us t3 < t6' => the write lock is observed first => t5 < t6 (the read 
> lock cannot
> be taken before the write unlock finishes) ...(v)
>
> Release semantics of read unlock => t8 < t1' ...(vi)
> Also, trivially t6 < t8...(vii)
>
> Combining v, vi and vii, t5 < t1'. Q.E.D
>
>
For the sake of rigor, t_i should be defined w.r.t loads and stores 
rather than

start and end of instruction, since theoretically, for example, t5 > t6 
is possible,

if t5 has a lot of, let's say, statistics updation operations. So, the 
relevant store

of write_unlock will finish, and the read lock will completely execute 
before

we actually exit the write_unlock due to the instructions embedded in 
write_unlock

which are not relevant to us.


>>
>