linux-kernel - Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aMk8QhkumtEoPVTh@willie-the-truck>
Date: Tue, 16 Sep 2025 11:30:26 +0100
From: Will Deacon <will@...nel.org>
To: Dev Jain <dev.jain@....com>
Cc: catalin.marinas@....com, anshuman.khandual@....com,
	quic_zhenhuah@...cinc.com, ryan.roberts@....com,
	kevin.brodsky@....com, yangyicong@...ilicon.com, joey.gouly@....com,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	david@...hat.com, mark.rutland@....com, urezki@...il.com,
	jthoughton@...gle.com
Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

Hi Dev,

On Wed, Jul 23, 2025 at 09:48:27PM +0530, Dev Jain wrote:
> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>  	}
>  
>  	table = pmd_offset(pudp, addr);
> +
> +	/*
> +	 * Our objective is to prevent ptdump from reading a PMD table which has
> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
> +	 *
> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
> +	 *
> +	 * Proof:
> +	 *
> +	 * Case 1: The static branch is visible to T2.
> +	 *
> +	 * Case 1 (a): T1 acquires the lock before T2 can.
> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
> +	 * executed after T1 exits CS.
> +	 *
> +	 * Case 1 (b): T2 acquires the lock before T1 can.
> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
> +	 * of the address of the isolated PMD table.
> +	 *
> +	 * Case 2: The static branch is not visible to T2.
> +	 *
> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
> +	 * have acquire semantics, it is guaranteed that the static branch
> +	 * will be visible to all CPUs before T1 can enter CS. The static
> +	 * branch not being visible to T2 therefore guarantees that T1 has
> +	 * not yet entered CS .... (i)
> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
> +	 * implies that if the invisibility of the static branch has been
> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
> +	 * then all CPUs will have observed an empty PUD ... (ii)
> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
> +	 * before entering CS => it is impossible for the CS to get hold of
> +	 * the address of the isolated PMD table. Q.E.D
> +	 *
> +	 * We have proven that the claim is true on the assumption that
> +	 * there is no context switch for T1 and T2. Note that the reasoning
> +	 * of the proof uses barriers operating on the inner shareable domain,
> +	 * which means that they will affect all CPUs, and also a context switch
> +	 * will insert extra barriers into the code paths => the claim will
> +	 * stand true even if we drop the assumption.
> +	 *
> +	 * It is also worth reasoning whether something can go wrong via
> +	 * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter
> +	 * will be called locklessly on this code path.
> +	 *
> +	 * For Case 1 (a), T2 will block until CS is finished, so we are safe.
> +	 * For Case 1 (b) and Case 2, the PMD table will be isolated before
> +	 * T1 can enter CS, therefore it is safe for T2 to operate on the
> +	 * PMD table locklessly.
> +	 */

Although I can see that you put a lot of effort into this comment, I
think we should just remove it. Instead, we should have a litmus test
in the commit message and probably just some small comments here to
explain e.g. why the mmap_read_lock() critical section is empty.

I'm currently trying to put together a litmus test with James (cc'd) so
maybe we can help you out with that part.

In the meantime...

> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 421a5de806c6..65335c7ba482 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st)
>  	note_page(pt_st, 0, -1, pte_val(pte_zero));
>  }
>  
> +static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
> +{
> +	static_branch_enable(&arm64_ptdump_lock_key);
> +	ptdump_walk_pgd(st, mm, NULL);
> +	static_branch_disable(&arm64_ptdump_lock_key);
> +}

What serialises the toggling of the static key here? For example, I can't
see what prevents a kernel page-table dump running concurrently with a
check_wx() and the key ending up in the wrong state.

Either we need an additional lock or perhaps using
static_branch_{inc,dec}() would work instead? I haven't thought too hard
about that but it looks like we need _something_.

Will