[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e722e49a-d982-4b58-98f7-6fef3d0a4468@arm.com>
Date: Mon, 1 Sep 2025 10:34:25 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>,
Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Yang Shi <yang@...amperecomputing.com>, Ard Biesheuvel <ardb@...nel.org>,
scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block
mapping when rodata=full
On 29/08/25 5:22 pm, Ryan Roberts wrote:
> Hi All,
>
> This is a new version following on from the v6 RFC at [1] which itself is based
> on Yang Shi's work. On systems with BBML2_NOABORT support, it causes the linear
> map to be mapped with large blocks, even when rodata=full, and leads to some
> nice performance improvements.
>
> I've tested this on an AmpereOne system (a VM with 12G RAM) in all 3 possible
> modes by hacking the BBML2 feature detection code:
>
> - mode 1: All CPUs support BBML2 so the linear map uses large mappings
> - mode 2: Boot CPU does not support BBML2 so linear map uses pte mappings
> - mode 3: Boot CPU supports BBML2 but secondaries do not so linear map
> initially uses large mappings but is then repainted to use pte mappings
>
> In all cases, mm selftests run and no regressions are observed. In all cases,
> ptdump of linear map is as expected:
>
> Mode 1:
> =======
> ---[ Linear Mapping start ]---
> 0xffff000000000000-0xffff000000200000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000000200000-0xffff000000210000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL
> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL
> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL
> 0xffff000002550000-0xffff000002600000 704K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000002600000-0xffff000004000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000004000000-0xffff000040000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED
> 0xffff000040000000-0xffff000140000000 4G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000140000000-0xffff000142000000 32M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED
> 0xffff000142000000-0xffff000142120000 1152K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000142120000-0xffff000142128000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142128000-0xffff000142159000 196K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142159000-0xffff000142160000 28K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142160000-0xffff000142240000 896K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000142240000-0xffff00014224e000 56K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff00014224e000-0xffff000142250000 8K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142250000-0xffff000142260000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142260000-0xffff000142280000 128K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000142280000-0xffff000142288000 32K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142288000-0xffff000142290000 32K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142290000-0xffff0001422a0000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff0001422a0000-0xffff000142465000 1812K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142465000-0xffff000142470000 44K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000142470000-0xffff000142600000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000142600000-0xffff000144000000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000144000000-0xffff000180000000 960M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED
> 0xffff000180000000-0xffff000181a00000 26M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000181a00000-0xffff000181b90000 1600K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000181b90000-0xffff000181b9d000 52K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181b9d000-0xffff000181c80000 908K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181c80000-0xffff000181c90000 64K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181c90000-0xffff000181ca0000 64K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000181ca0000-0xffff000181dbd000 1140K PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181dbd000-0xffff000181dc0000 12K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181dc0000-0xffff000181e00000 256K PTE RW NX SHD AF CON UXN MEM/NORMAL-TAGGED
> 0xffff000181e00000-0xffff000182000000 2M PMD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000182000000-0xffff0001c0000000 992M PMD RW NX SHD AF CON BLK UXN MEM/NORMAL-TAGGED
> 0xffff0001c0000000-0xffff000300000000 5G PUD RW NX SHD AF BLK UXN MEM/NORMAL-TAGGED
> 0xffff000300000000-0xffff008000000000 500G PUD
> 0xffff008000000000-0xffff800000000000 130560G PGD
> ---[ Linear Mapping end ]---
>
> Mode 3:
> =======
> ---[ Linear Mapping start ]---
> 0xffff000000000000-0xffff000000210000 2112K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000000210000-0xffff000000400000 1984K PTE ro NX SHD AF UXN MEM/NORMAL
> 0xffff000000400000-0xffff000002400000 32M PMD ro NX SHD AF BLK UXN MEM/NORMAL
> 0xffff000002400000-0xffff000002550000 1344K PTE ro NX SHD AF UXN MEM/NORMAL
> 0xffff000002550000-0xffff000143a61000 5264452K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000143a61000-0xffff000143c61000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000143c61000-0xffff000181b9a000 1015012K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181b9a000-0xffff000181d9a000 2M PTE ro NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000181d9a000-0xffff000300000000 6261144K PTE RW NX SHD AF UXN MEM/NORMAL-TAGGED
> 0xffff000300000000-0xffff008000000000 500G PUD
> 0xffff008000000000-0xffff800000000000 130560G PGD
> ---[ Linear Mapping end ]---
>
>
> Performance Testing
> ===================
>
> Yang Shi has gathered some compelling results which are detailed in the commit
> log for patch #3. Additionally I have run this through a random selection of
> benchmarks on AmpereOne. None show any regressions, and various benchmarks show
> statistically significant improvement. I'm just showing those improvements here:
>
> +----------------------+----------------------------------------------------------+-------------------------+
> | Benchmark | Result Class | Improvement vs 6.17-rc1 |
> +======================+==========================================================+=========================+
> | micromm/vmalloc | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | (I) -9.00% |
> | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.93% |
> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | (I) -6.77% |
> | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | (I) -4.63% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | mmtests/hackbench | process-sockets-30 (seconds) | (I) -2.96% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | mmtests/kernbench | syst-192 (seconds) | (I) -12.77% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/perl-benchmark | Test: Interpreter (Seconds) | (I) -4.86% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/pgbench | Scale: 1 Clients: 1 Read Write (TPS) | (I) 5.07% |
> | | Scale: 1 Clients: 1 Read Write - Latency (ms) | (I) -4.72% |
> | | Scale: 100 Clients: 1000 Read Write (TPS) | (I) 2.58% |
> | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | (I) -2.52% |
> +----------------------+----------------------------------------------------------+-------------------------+
> | pts/sqlite-speedtest | Timed Time - Size 1,000 (Seconds) | (I) -2.68% |
> +----------------------+----------------------------------------------------------+-------------------------+
>
>
> Changes since v6 [1]
> ====================
>
> - Patch 1: Minor refactor to implement walk_kernel_page_table_range() in terms
> of walk_kernel_page_table_range_lockless(). Also lead to adding *pmd argument
> to the lockless variant for consistency (per Catalin).
> - Misc function/variable renames to improve clarity and consistency.
> - Share same syncrhonization flag between idmap_kpti_install_ng_mappings and
> wait_linear_map_split_to_ptes, which allows removal of bbml2_ptes[] to save
> ~20K from kernel image.
> - Only take pgtable_split_lock and enter lazy mmu mode once for both splits.
> - Only walk the pgtable once for the common "split single page" case.
> - Bypass split to contpmd and contpte when spllitting linear map to ptes.
>
>
> Applies on v6.17-rc3.
>
>
> [1] https://lore.kernel.org/linux-arm-kernel/20250805081350.3854670-1-ryan.roberts@arm.com/
>
> Thanks,
> Ryan
>
> Dev Jain (1):
> arm64: Enable permission change on arm64 kernel block mappings
>
> Ryan Roberts (3):
> arm64: mm: Optimize split_kernel_leaf_mapping()
> arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs
> arm64: mm: Optimize linear_map_split_to_ptes()
>
> Yang Shi (2):
> arm64: cpufeature: add AmpereOne to BBML2 allow list
> arm64: mm: support large block mapping when rodata=full
>
> arch/arm64/include/asm/cpufeature.h | 2 +
> arch/arm64/include/asm/mmu.h | 3 +
> arch/arm64/include/asm/pgtable.h | 5 +
> arch/arm64/kernel/cpufeature.c | 12 +-
> arch/arm64/mm/mmu.c | 418 +++++++++++++++++++++++++++-
> arch/arm64/mm/pageattr.c | 157 ++++++++---
> arch/arm64/mm/proc.S | 27 +-
> include/linux/pagewalk.h | 3 +
> mm/pagewalk.c | 36 ++-
> 9 files changed, 599 insertions(+), 64 deletions(-)
>
> --
> 2.43.0
>
Hi Yang and Ryan,
I observe there are various callsites which will ultimately use update_range_prot() (from patch 1),
that they do not check the return value. I am listing the ones I could find:
set_memory_ro() in bpf_jit_comp.c
set_memory_valid() in kernel_map_pages() in pageattr.c
set_direct_map_invalid_noflush() in vm_reset_perms() in vmalloc.c
set_direct_map_default_noflush() in vm_reset_perms() in vmalloc.c, and in secretmem.c
(the secretmem.c ones should be safe as explained in the commments therein)
The first one I think can be handled easily by returning -EFAULT.
For the second, we are already returning in case of !can_set_direct_map, which renders DEBUG_PAGEALLOC useless. So maybe it is
safe to ignore the ret from set_memory_valid?
For the third, the call chain is a sequence of must-succeed void functions. Notably, when using vfree(), we may have to allocate a single
pagetable page for splitting.
I am wondering whether we can just have a warn_on_once or something for the case when we fail to allocate a pagetable page. Or, Ryan had
suggested in an off-the-list conversation that we can maintain a cache of PTE tables for every PMD block mapping, which will give us
the same memory consumption as we do today, but not sure if this is worth it. x86 can already handle splitting but due to the callchains
I have described above, it has the same problem, and the code has been working for years :)
Powered by blists - more mailing lists