[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb0d2181-4e91-4c1a-9410-8729cbba0dfe@arm.com>
Date: Mon, 28 Apr 2025 17:17:20 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Mikołaj Lenczewski <miko.lenczewski@....com>,
suzuki.poulose@....com, yang@...amperecomputing.com, corbet@....net,
catalin.marinas@....com, will@...nel.org, jean-philippe@...aro.org,
robin.murphy@....com, joro@...tes.org, akpm@...ux-foundation.org,
paulmck@...nel.org, mark.rutland@....com, joey.gouly@....com,
maz@...nel.org, james.morse@....com, broonie@...nel.org,
oliver.upton@...ux.dev, baohua@...nel.org, david@...hat.com,
ioworker0@...il.com, jgg@...pe.ca, nicolinc@...dia.com, mshavit@...gle.com,
jsnitsel@...hat.com, smostafa@...gle.com, kevin.tian@...el.com,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev
Subject: Re: [RESEND PATCH v6 3/3] arm64/mm: Elide tlbi in contpte_convert()
under BBML2
On 28/04/2025 16:35, Mikołaj Lenczewski wrote:
> When converting a region via contpte_convert() to use mTHP, we have two
> different goals. We have to mark each entry as contiguous, and we would
> like to smear the dirty and young (access) bits across all entries in
> the contiguous block. Currently, we do this by first accumulating the
> dirty and young bits in the block, using an atomic
> __ptep_get_and_clear() and the relevant pte_{dirty,young}() calls,
> performing a tlbi, and finally smearing the correct bits across the
> block using __set_ptes().
>
> This approach works fine for BBM level 0, but with support for BBM level
> 2 we are allowed to reorder the tlbi to after setting the pagetable
> entries. We expect the time cost of a tlbi to be much greater than the
> cost of clearing and resetting the PTEs. As such, this reordering of the
> tlbi outside the window where our PTEs are invalid greatly reduces the
> duration the PTE are visibly invalid for other threads. This reduces the
> likelyhood of a concurrent page walk finding an invalid PTE, reducing
> the likelyhood of a fault in other threads, and improving performance
> (more so when there are more threads).
>
> Because we support via allowlist only bbml2 implementations that never
> raise conflict aborts and instead invalidate the tlb entries
> automatically in hardware, we can avoid the final flush altogether.
>
> However, avoiding the intermediate tlbi+dsb must be carefully considered
> to ensure that we remain both correct and performant. We document our
> reasoning and the expected interactions further in the contpte_convert()
> source. To do so we rely on the aarch64 spec (DDI 0487L.a D8.7.1.1)
> requirements RNGLXZ and RJQQTC to provide guarantees that the elision is
> correct.
>
> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@....com>
Reviewed-by: Ryan Roberts <ryan.roberts@....com>
It would be great to see this series merged this cycle!
Thanks,
Ryan
> ---
> arch/arm64/mm/contpte.c | 139 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 138 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..203357061d0a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -68,7 +68,144 @@ static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> pte = pte_mkyoung(pte);
> }
>
> - __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> + /*
> + * On eliding the __tlb_flush_range() under BBML2+noabort:
> + *
> + * NOTE: Instead of using N=16 as the contiguous block length, we use
> + * N=4 for clarity.
> + *
> + * NOTE: 'n' and 'c' are used to denote the "contiguous bit" being
> + * unset and set, respectively.
> + *
> + * We worry about two cases where contiguous bit is used:
> + * - When folding N smaller non-contiguous ptes as 1 contiguous block.
> + * - When unfolding a contiguous block into N smaller non-contiguous ptes.
> + *
> + * Currently, the BBML0 folding case looks as follows:
> + *
> + * 0) Initial page-table layout:
> + *
> + * +----+----+----+----+
> + * |RO,n|RO,n|RO,n|RW,n| <--- last page being set as RO
> + * +----+----+----+----+
> + *
> + * 1) Aggregate AF + dirty flags using __ptep_get_and_clear():
> + *
> + * +----+----+----+----+
> + * | 0 | 0 | 0 | 0 |
> + * +----+----+----+----+
> + *
> + * 2) __flush_tlb_range():
> + *
> + * |____ tlbi + dsb ____|
> + *
> + * 3) __set_ptes() to repaint contiguous block:
> + *
> + * +----+----+----+----+
> + * |RO,c|RO,c|RO,c|RO,c|
> + * +----+----+----+----+
> + *
> + * 4) The kernel will eventually __flush_tlb() for changed page:
> + *
> + * |____| <--- tlbi + dsb
> + *
> + * As expected, the intermediate tlbi+dsb ensures that other PEs
> + * only ever see an invalid (0) entry, or the new contiguous TLB entry.
> + * The final tlbi+dsb will always throw away the newly installed
> + * contiguous TLB entry, which is a micro-optimisation opportunity,
> + * but does not affect correctness.
> + *
> + * In the BBML2 case, the change is avoiding the intermediate tlbi+dsb.
> + * This means a few things, but notably other PEs will still "see" any
> + * stale cached TLB entries. This could lead to a "contiguous bit
> + * misprogramming" issue until the final tlbi+dsb of the changed page,
> + * which would clear out both the stale (RW,n) entry and the new (RO,c)
> + * contiguous entry installed in its place.
> + *
> + * What this is saying, is the following:
> + *
> + * +----+----+----+----+
> + * |RO,n|RO,n|RO,n|RW,n| <--- old page tables, all non-contiguous
> + * +----+----+----+----+
> + *
> + * +----+----+----+----+
> + * |RO,c|RO,c|RO,c|RO,c| <--- new page tables, all contiguous
> + * +----+----+----+----+
> + * /\
> + * ||
> + *
> + * If both the old single (RW,n) and new contiguous (RO,c) TLB entries
> + * are present, and a write is made to this address, do we fault or
> + * is the write permitted (via amalgamation)?
> + *
> + * The relevant Arm ARM DDI 0487L.a requirements are RNGLXZ and RJQQTC,
> + * and together state that when BBML1 or BBML2 are implemented, either
> + * a TLB conflict abort is raised (which we expressly forbid), or will
> + * "produce an OA, access permissions, and memory attributes that are
> + * consistent with any of the programmed translation table values".
> + *
> + * That is to say, will either raise a TLB conflict, or produce one of
> + * the cached TLB entries, but never amalgamate.
> + *
> + * Thus, as the page tables are only considered "consistent" after
> + * the final tlbi+dsb (which evicts both the single stale (RW,n) TLB
> + * entry as well as the new contiguous (RO,c) TLB entry), omitting the
> + * initial tlbi+dsb is correct.
> + *
> + * It is also important to note that at the end of the BBML2 folding
> + * case, we are still left with potentially all N TLB entries still
> + * cached (the N-1 non-contiguous ptes, and the single contiguous
> + * block). However, over time, natural TLB pressure will cause the
> + * non-contiguous pte TLB entries to be flushed, leaving only the
> + * contiguous block TLB entry. This means that omitting the tlbi+dsb is
> + * not only correct, but also keeps our eventual performance benefits.
> + *
> + * For the unfolding case, BBML0 looks as follows:
> + *
> + * 0) Initial page-table layout:
> + *
> + * +----+----+----+----+
> + * |RW,c|RW,c|RW,c|RW,c| <--- last page being set as RO
> + * +----+----+----+----+
> + *
> + * 1) Aggregate AF + dirty flags using __ptep_get_and_clear():
> + *
> + * +----+----+----+----+
> + * | 0 | 0 | 0 | 0 |
> + * +----+----+----+----+
> + *
> + * 2) __flush_tlb_range():
> + *
> + * |____ tlbi + dsb ____|
> + *
> + * 3) __set_ptes() to repaint as non-contiguous:
> + *
> + * +----+----+----+----+
> + * |RW,n|RW,n|RW,n|RW,n|
> + * +----+----+----+----+
> + *
> + * 4) Update changed page permissions:
> + *
> + * +----+----+----+----+
> + * |RW,n|RW,n|RW,n|RO,n| <--- last page permissions set
> + * +----+----+----+----+
> + *
> + * 5) The kernel will eventually __flush_tlb() for changed page:
> + *
> + * |____| <--- tlbi + dsb
> + *
> + * For BBML2, we again remove the intermediate tlbi+dsb. Here, there
> + * are no issues, as the final tlbi+dsb covering the changed page is
> + * guaranteed to remove the original large contiguous (RW,c) TLB entry,
> + * as well as the intermediate (RW,n) TLB entry; the next access will
> + * install the new (RO,n) TLB entry and the page tables are only
> + * considered "consistent" after the final tlbi+dsb, so software must
> + * be prepared for this inconsistency prior to finishing the mm dance
> + * regardless.
> + */
> +
> + if (!system_supports_bbml2_noabort())
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>
> __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> }
Powered by blists - more mailing lists