linux-kernel - Re: [PATCH v7 4/4] arm64/mm: Elide tlbi in contpte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aFRlDSZ2PPnHixjc@arm.com>
Date: Thu, 19 Jun 2025 20:29:17 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Mikołaj Lenczewski <miko.lenczewski@....com>
Cc: ryan.roberts@....com, yang@...amperecomputing.com, will@...nel.org,
	jean-philippe@...aro.org, robin.murphy@....com, joro@...tes.org,
	maz@...nel.org, oliver.upton@...ux.dev, joey.gouly@....com,
	james.morse@....com, broonie@...nel.org, ardb@...nel.org,
	baohua@...nel.org, suzuki.poulose@....com, david@...hat.com,
	jgg@...pe.ca, nicolinc@...dia.com, jsnitsel@...hat.com,
	mshavit@...gle.com, kevin.tian@...el.com,
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
	iommu@...ts.linux.dev
Subject: Re: [PATCH v7 4/4] arm64/mm: Elide tlbi in contpte_convert() under
 BBML2

On Tue, Jun 17, 2025 at 09:51:04AM +0000, Mikołaj Lenczewski wrote:
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..203357061d0a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -68,7 +68,144 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>  			pte = pte_mkyoung(pte);
>  	}
>  
> -	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +	/*
> +	 * On eliding the __tlb_flush_range() under BBML2+noabort:
> +	 *
> +	 * NOTE: Instead of using N=16 as the contiguous block length, we use
> +	 *       N=4 for clarity.
> +	 *
> +	 * NOTE: 'n' and 'c' are used to denote the "contiguous bit" being
> +	 *       unset and set, respectively.
> +	 *
> +	 * We worry about two cases where contiguous bit is used:
> +	 *  - When folding N smaller non-contiguous ptes as 1 contiguous block.
> +	 *  - When unfolding a contiguous block into N smaller non-contiguous ptes.
> +	 *
> +	 * Currently, the BBML0 folding case looks as follows:
> +	 *
> +	 *  0) Initial page-table layout:
> +	 *
> +	 *   +----+----+----+----+
> +	 *   |RO,n|RO,n|RO,n|RW,n| <--- last page being set as RO
> +	 *   +----+----+----+----+
> +	 *
> +	 *  1) Aggregate AF + dirty flags using __ptep_get_and_clear():
> +	 *
> +	 *   +----+----+----+----+
> +	 *   |  0 |  0 |  0 |  0 |
> +	 *   +----+----+----+----+
> +	 *
> +	 *  2) __flush_tlb_range():
> +	 *
> +	 *   |____ tlbi + dsb ____|
> +	 *
> +	 *  3) __set_ptes() to repaint contiguous block:
> +	 *
> +	 *   +----+----+----+----+
> +	 *   |RO,c|RO,c|RO,c|RO,c|
> +	 *   +----+----+----+----+

>From the initial layout to point (3), we are also changing the
permission. Given the rules you mentioned in the Arm ARM, I think that's
safe (hardware seeing either the old or the new attributes). The
FEAT_BBM description, however, only talks about change between larger
and smaller blocks but no mention of also changing the attributes at the
same time. Hopefully the microarchitects claiming certain CPUs don't
generate conflict aborts understood what Linux does.

> +	 *
> +	 *  4) The kernel will eventually __flush_tlb() for changed page:
> +	 *
> +	 *                  |____| <--- tlbi + dsb
[...]
> +	 * It is also important to note that at the end of the BBML2 folding
> +	 * case, we are still left with potentially all N TLB entries still
> +	 * cached (the N-1 non-contiguous ptes, and the single contiguous
> +	 * block). However, over time, natural TLB pressure will cause the
> +	 * non-contiguous pte TLB entries to be flushed, leaving only the
> +	 * contiguous block TLB entry. This means that omitting the tlbi+dsb is
> +	 * not only correct, but also keeps our eventual performance benefits.

Step 4 above implies some TLB flushing from the core code eventually.
What is the situation mentioned in the paragraph above? Is it only until
we get the TLB flushing from the core code?

[...]
> +	if (!system_supports_bbml2_noabort())
> +		__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>  
>  	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);

Eliding the TLBI here is all good but looking at the overall set_ptes(),
why do we bother with unfold+fold for BBML2? Can we not just change
them in place without going through __ptep_get_and_clear()?

-- 
Catalin