linux-kernel - Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com>
Date: Mon, 14 Jul 2025 14:56:23 +0530
From: Dev Jain <dev.jain@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
 ryan.roberts@....com, kevin.brodsky@....com, yangyicong@...ilicon.com,
 joey.gouly@....com, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, david@...hat.com
Subject: Re: [PATCH v4] arm64: Enable vmalloc-huge with ptdump


On 07/07/25 6:14 am, Catalin Marinas wrote:
> On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
>> On 04/07/25 4:52 pm, Catalin Marinas wrote:
>>> On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
>>>> @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>>>    	}
>>>>    	table = pmd_offset(pudp, addr);
>>>> +	/*
>>>> +	 * Isolate the PMD table; in case of race with ptdump, this helps
>>>> +	 * us to avoid taking the lock in __pmd_free_pte_page().
>>>> +	 *
>>>> +	 * Static key logic:
>>>> +	 *
>>>> +	 * Case 1: If ptdump does static_branch_enable(), and after that we
>>>> +	 * execute the if block, then this patches in the read lock, ptdump has
>>>> +	 * the write lock patched in, therefore ptdump will never read from
>>>> +	 * a potentially freed PMD table.
>>>> +	 *
>>>> +	 * Case 2: If the if block starts executing before ptdump's
>>>> +	 * static_branch_enable(), then no locking synchronization
>>>> +	 * will be done. However, pud_clear() + the dsb() in
>>>> +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
> [...]
>>> I don't get case 2. You want to ensure pud_clear() is observed by the
>>> ptdump code before the pmd_free(). The DSB in the TLB flushing code
>>> ensures some ordering between the pud_clear() and presumably something
>>> that the ptdump code can observe as well. Would that be the mmap
>>> semaphore? However, the read_lock would only be attempted if this code
>>> is seeing the static branch update, which is not guaranteed. I don't
>>> think it even matters since the lock may be released anyway before the
>>> write_lock in ptdump.
>>>
>>> For example, you do a pud_clear() above, skip the whole static branch.
>>> The ptdump comes along on another CPU but does not observe the
>>> pud_clear() since there's no synchronisation. It goes ahead and
>>> dereferences it while this CPU does a pmd_free().
>> The objective is: ptdump must not dereference a freed pagetable.
>> So for your example, if the static branch is not observed by
>> pud_free_pmd_page, this means that ptdump will take the write lock
>> after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
>> the write lock before execution of flush_tlb_kernel_pagetable completes, we have
>> executed static_branch_enable(), contradiction).
> I don't see why the write lock matters since pud_free_pmd_page() doesn't

True.

> take the read lock in the second scenario. What we need is acquire
> semantics after the static branch update on the ptdump path but we get
> it before we even attempt the write lock.
>
> For simplicity, ignoring the observability of instruction writes and
> considering the static branch a variable, if pud_free_pmd_page() did not
> observe the static branch update, is the ptdump guaranteed to see the
> cleared pud subsequently?
>
> With initial state pud=1 (non-zero), stb=0 (static branch):
>
> P0 (pud_free_pmd_page)		P1 (ptdump)
>
>      W_pud=0			   W_stb=1
>      DSB				   barrier/acq
>      R_stb=0			   R_pud=?
>
> The write to the static branch on P1 will be ordered after the read of
> the branch on P0, so the pud will be seen as 0. It's not even worth
> mentioning the semaphore here as the static branch update has enough
> barriers for cache flushing and kick_all_cpus_sync().
>
>
> The other scenario is P0 (pud_free_pmd_page) observing the write to the
> static branch (that's case 1 in your comment). This doesn't say anything
> about whether P1 (ptdump) sees a clear or valid pud. What we do know is
> that P0 will try to acquire (and release) the lock. If P1 already
> acquired the write lock, P0 will wait and the state of the pud is
> irrelevant (no freeing). Similarly if P1 already completed by the time
> P0 takes the lock.
>
> If P0 takes the lock first, the lock release guarantees that the
> pud_clear() is seen by the ptdump code _after_ it acquired the lock.
>
>
> W.r.t. the visibility of the branch update vs pud access, the
> combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
> maintenance to PoU together with kick_all_cpus_sync() on P1 should
> suffice.
>
> I think the logic works (though I'm heavily jetlagged and writing from a
> plane) but the comments need to be improved. As described above, case 1
> has two sub-cases depending on when P0 runs in relation to the write
> lock (before or during/after). And the write lock doesn't matter for
> case 2.
>
>>> And I can't get my head around memory ordering, it doesn't look sound.
>>> static_branch_enable() I don't think has acquire semantics, at least not
>>> in relation to the actual branch update. The static_branch_unlikely()
>>> test, again, not sure what guarantees it has (I don't think it has any
>>> in relation to memory loads). Maybe you have worked it all out and is
>>> fine but it needs a better explanation and ideally some simple formal
>>> model. Show it's correct with a global variable first and then we can
>>> optimise with static branches.
>> What do you suggest? As in what do you mean by showing its correct with
>> a global variable first...and, for the formal model thingy, do you
>> want mathematical rigor similar to what you explain in [1] :P, because unfortunately
>> (and quite embarrassingly) I didn't have formal methods in college : )
> Neither did I ;) (mostly analogue electronics). I was thinking something
> like our cat/herd tools where you can write some simple assembly. It's a
> good investment if you want to give it a try.


Will the following proof work -

Proof of correctness: The below diagram represents pud_free_pmd_page
executing on P0 and ptdump executing on P1. Note that, we can ignore
the situation when processes migrate to another CPU, since we will
have extra barriers because of switch_to(), and all of the embedded
barriers that are used in the reasoning of the proof below apply
to the inner shareable domain, and therefore will be observed by
all CPUs, therefore it suffices to prove for the case where
pud_free_pmd_page executes completely on P0 and ptdump
executes completely on P1.

Let t_i, 0 <= i <= 8 denote the *global* timestamp taken for the corresponding
instruction to complete. Therefore from here on we do not need to use the term
"observe" in a relative context. Let t_i' (t_i dash) denote the global timestamp
for the corresponding instruction to start. That is, an instruction labelled
with t_i implies that it started at t_i' and finished at t_i.


P0				P1:

W_PUD = 0: t0			x = 1: t2

if (x == 1) {: t7			write lock: t3
	read lock: t6		R_PUD = 1: t4
	read unlock: t8		write unlock: t5
}

Free PUD: t1

We need to prove that ptdump completely finishes before
we free the PUD. Since write unlock has release semantics,
if the write unlock finishes, it is guaranteed that ptdump
has finished => it suffices to prove that t5 < t1'.


R_PUD = 1 => t4 < t0 .... (i)

Because of acquire semantics of down_write(rw_semaphore lock),
t3 < t4' .... (ii)

(i) and (ii) (and t4' < t4) => t3 < t0 ... (iii)

ptdump is executed on a single kernel thread, which implies
that the transition x = 1 -> x = 1 will never happen; that is,
when static_branch_enable() is executed, then x was 0, which
means that the call chain static_key_enable -> static_key_enable_cpuslocked
-> jump_label_update -> jump_label_can_update/ arch_jump_label_transform_apply
-> kick_all_cpus_sync -> smp_mb -> __smp_mb -> dmb(ish) will always be followed.
The emission of dmb(ish) => t2 < t3 ... (iv)

(iii) and (iv) => t2 < t0, also, flush_tlb_kernel_pgtable -> dsb(ish) ensures that t0 < t7'
=> t2 < t7' => the static branch is observed by P0 always => t6 and t8 exist.

Now, t0 < t6' because of flush_tlb_kernel_pgtable; combining with (iii), this gives
us t3 < t6' => the write lock is observed first => t5 < t6 (the read lock cannot
be taken before the write unlock finishes) ...(v)

Release semantics of read unlock => t8 < t1' ...(vi)
Also, trivially t6 < t8...(vii)

Combining v, vi and vii, t5 < t1'. Q.E.D


>