linux-kernel - Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aIpPsg1Kij8Knnjl@arm.com>
Date: Wed, 30 Jul 2025 18:00:34 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Dev Jain <dev.jain@....com>
Cc: will@...nel.org, anshuman.khandual@....com, quic_zhenhuah@...cinc.com,
	ryan.roberts@....com, kevin.brodsky@....com,
	yangyicong@...ilicon.com, joey.gouly@....com,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	david@...hat.com, mark.rutland@....com, urezki@...il.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
> index 12f534e8f3ed..e835fd437ae0 100644
> --- a/arch/arm64/include/asm/vmalloc.h
> +++ b/arch/arm64/include/asm/vmalloc.h
> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>  	/*
>  	 * SW table walks can't handle removal of intermediate entries.
>  	 */
> -	return pud_sect_supported() &&
> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
> +	return pud_sect_supported();

Does the "SW table walks..." comment still make sense?

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 00ab1d648db6..49932c1dd34e 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
[...]
> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +
> +	/*
> +	 * Our objective is to prevent ptdump from reading a PMD table which has
> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
> +	 * T2) executes completely on CPU2. Let the region sandwiched by the

I think "executes completely on CPU*" is confusing. Just talk about
threads as they can be migrated between CPUs and the memory model is
preserved.

> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
> +	 *
> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
> +	 *
> +	 * Proof:
> +	 *
> +	 * Case 1: The static branch is visible to T2.
> +	 *
> +	 * Case 1 (a): T1 acquires the lock before T2 can.
> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
> +	 * executed after T1 exits CS.

This assumes that there is some ordering between unlock and pmd_free()
(e.g. some poisoning of the old page). The unlock only gives us release
semantics, not acquire. It just happens that we have an atomic
dec-and-test down the __free_pages() path but I'm not convinced we
should rely on it unless free_pages() has clear semantics on ordering
related to prior memory writes.

> +	 *
> +	 * Case 1 (b): T2 acquires the lock before T1 can.
> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
> +	 * of the address of the isolated PMD table.

Is this the effect of the barriers in the TLB flushing or the release
semantics of the unlock?

> +	 *
> +	 * Case 2: The static branch is not visible to T2.
> +	 *
> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
> +	 * have acquire semantics, it is guaranteed that the static branch

Does static_branch_disable() have release semantics? I think it does via
some atomic_cmpxchg() but it's worth spelling it out.

> +	 * will be visible to all CPUs before T1 can enter CS. The static

As I mentioned on a previous version, this visibility is not absolute.
You could say that it will be observed by other CPUs before they observe
the write_lock.

> +	 * branch not being visible to T2 therefore guarantees that T1 has
> +	 * not yet entered CS .... (i)
> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
> +	 * implies that if the invisibility of the static branch has been
> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
> +	 * then all CPUs will have observed an empty PUD ... (ii)
> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
> +	 * before entering CS => it is impossible for the CS to get hold of
> +	 * the address of the isolated PMD table. Q.E.D

OK, I nearly got lost. That's not a straightforward memory ordering with
as we have instruction updates with ISB as part of the TLB flushing. I
concluded last time I looked that this branch patching part works
because we also have kick_all_cpus_sync() as part of the static branch
update.

Stepping back to a simpler model, let's say that the static key is a
variable. I wrote this quick test for herd7 and the Linux kernel memory
model (fairly easy to play with):

-------------------------------------8<---------------------------------------
C pte_free_ptdump

(*
 * $ cd tools/memory-model
 * $ herd7 -conf linux-kernel.cfg path/to/pte_free_ptdump.litmus
 *)

{
	pmd = 1;				(* allocated pmd *)
	pte_page = 1;				(* valid pte page pointed at by the pmd *)
}

// pmd_free_pte_page()
P0(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
	WRITE_ONCE(*pmd, 0);			// pmd_clear()
	smp_mb();
	if (READ_ONCE(*ptdump_lock_key)) {	// static_branch() approximation
		spin_lock(init_mm);		// mmap_read_lock()
		spin_unlock(init_mm);		// mmap_read_unlock()
	}
	smp_mb();
	WRITE_ONCE(*pte_page, 0);		// pte_free_kernel()
}

// ptdump_walk_pgd()
P1(spinlock_t *init_mm, int *ptdump_lock_key, int *pmd, int *pte_page)
{
	int val;
	int page;

	WRITE_ONCE(*ptdump_lock_key, 1);	// static_branch_enable()
	smp_mb();
	spin_lock(init_mm);			// mmap_write_lock()
	val = READ_ONCE(*pmd);
	page = READ_ONCE(*pte_page);		// freed pte page?
	spin_unlock(init_mm);			// mmap_write_unlock()
	smp_mb();
	WRITE_ONCE(*ptdump_lock_key, 0);	// static_branch_disable()
}

exists(1:val=1 /\ 1:page=0)
-------------------------------------8<---------------------------------------

I sprinkled some necessary smp_mb() but in most cases we have
release/acquire semantics. It does show that we need a barrier before
the page freeing. We also need to acquire for enabling the static key
and release for disabling it.

Next step is to assess that replacing the static key variable read/write
with code patching preserves the model.

-- 
Catalin