[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa932ea3-d408-40a1-a734-60ac8fc73a50@arm.com>
Date: Mon, 16 Jun 2025 13:33:29 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Yang Shi <yang@...amperecomputing.com>, will@...nel.org,
catalin.marinas@....com, Miko.Lenczewski@....com, dev.jain@....com,
scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/4] arm64: mm: support large block mapping when
rodata=full
On 16/06/2025 12:58, Ryan Roberts wrote:
> On 31/05/2025 03:41, Yang Shi wrote:
>> When rodata=full is specified, kernel linear mapping has to be mapped at
>> PTE level since large page table can't be split due to break-before-make
>> rule on ARM64.
>>
>> This resulted in a couple of problems:
>> - performance degradation
>> - more TLB pressure
>> - memory waste for kernel page table
>>
>> With FEAT_BBM level 2 support, splitting large block page table to
>> smaller ones doesn't need to make the page table entry invalid anymore.
>> This allows kernel split large block mapping on the fly.
>>
>> Add kernel page table split support and use large block mapping by
>> default when FEAT_BBM level 2 is supported for rodata=full. When
>> changing permissions for kernel linear mapping, the page table will be
>> split to smaller size.
>>
>> The machine without FEAT_BBM level 2 will fallback to have kernel linear
>> mapping PTE-mapped when rodata=full.
>>
>> With this we saw significant performance boost with some benchmarks and
>> much less memory consumption on my AmpereOne machine (192 cores, 1P) with
>> 256GB memory.
>>
>> * Memory use after boot
>> Before:
>> MemTotal: 258988984 kB
>> MemFree: 254821700 kB
>>
>> After:
>> MemTotal: 259505132 kB
>> MemFree: 255410264 kB
>>
>> Around 500MB more memory are free to use. The larger the machine, the
>> more memory saved.
>>
>> * Memcached
>> We saw performance degradation when running Memcached benchmark with
>> rodata=full vs rodata=on. Our profiling pointed to kernel TLB pressure.
>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>> latency is reduced by around 9.6%.
>> The gain mainly came from reduced kernel TLB misses. The kernel TLB
>> MPKI is reduced by 28.5%.
>>
>> The benchmark data is now on par with rodata=on too.
>>
>> * Disk encryption (dm-crypt) benchmark
>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>> encryption (by dm-crypt).
>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>> --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>> --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>> --name=iops-test-job --eta-newline=1 --size 100G
>>
>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>> number of good case is around 90% more than the best number of bad case).
>> The bandwidth is increased and the avg clat is reduced proportionally.
>>
>> * Sequential file read
>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>> The bandwidth is increased by 150%.
>>
>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>> ---
>> arch/arm64/include/asm/cpufeature.h | 26 +++
>> arch/arm64/include/asm/mmu.h | 1 +
>> arch/arm64/include/asm/pgtable.h | 12 +-
>> arch/arm64/kernel/cpufeature.c | 2 +-
>> arch/arm64/mm/mmu.c | 269 +++++++++++++++++++++++++---
>> arch/arm64/mm/pageattr.c | 37 +++-
>> 6 files changed, 319 insertions(+), 28 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index 8f36ffa16b73..a95806980298 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -1053,6 +1053,32 @@ static inline bool cpu_has_lpa2(void)
>> #endif
>> }
>>
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr);
>> +
>> +static inline bool has_nobbml2_override(void)
>> +{
>> + u64 mmfr2;
>> + unsigned int bbm;
>> +
>> + mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
>> + mmfr2 &= ~id_aa64mmfr2_override.mask;
>> + mmfr2 |= id_aa64mmfr2_override.val;
>> + bbm = cpuid_feature_extract_unsigned_field(mmfr2,
>> + ID_AA64MMFR2_EL1_BBM_SHIFT);
>> + return bbm == 0;
>> +}
>> +
>> +/*
>> + * Called at early boot stage on boot CPU before cpu info and cpu feature
>> + * are ready.
>> + */
>> +static inline bool bbml2_noabort_available(void)
>> +{
>> + return IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT) &&
>> + cpu_has_bbml2_noabort(read_cpuid_id()) &&
>> + !has_nobbml2_override();
>
> Based on Will's feedback, The Kconfig and the cmdline override will both
> disappear in Miko's next version and we will only use the MIDR list to decide
> BBML2_NOABORT status, so this will significantly simplify. Sorry about the churn
> here.
>
>> +}
>> +
>> #endif /* __ASSEMBLY__ */
>>
>> #endif
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 6e8aa8e72601..2693d63bf837 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>> pgprot_t prot, bool page_mappings_only);
>> extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>> extern void mark_linear_text_alias_ro(void);
>> +extern int split_linear_mapping(unsigned long start, unsigned long end);
>
> nit: Perhaps split_leaf_mapping() or split_kernel_pgtable_mapping() or something
> similar is more generic which will benefit us in future when using this for
> vmalloc too?
>
>>
>> /*
>> * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index d3b538be1500..bf3cef31d243 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -293,6 +293,11 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
>> return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
>> }
>>
>> +static inline pmd_t pmd_mknoncont(pmd_t pmd)
>> +{
>> + return __pmd(pmd_val(pmd) & ~PMD_SECT_CONT);
>> +}
>> +
>> static inline pte_t pte_mkdevmap(pte_t pte)
>> {
>> return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> @@ -769,7 +774,7 @@ static inline bool in_swapper_pgdir(void *addr)
>> ((unsigned long)swapper_pg_dir & PAGE_MASK);
>> }
>>
>> -static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
>> {
>> #ifdef __PAGETABLE_PMD_FOLDED
>> if (in_swapper_pgdir(pmdp)) {
>> @@ -779,6 +784,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> #endif /* __PAGETABLE_PMD_FOLDED */
>>
>> WRITE_ONCE(*pmdp, pmd);
>> +}
>> +
>> +static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +{
>> + __set_pmd_nosync(pmdp, pmd);
>>
>> if (pmd_valid(pmd)) {
>> dsb(ishst);
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index e879bfcf853b..5fc2a4a804de 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -2209,7 +2209,7 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
>> return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
>> }
>>
>> -static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>> {
>> /*
>> * We want to allow usage of bbml2 in as wide a range of kernel contexts
>
>
> [...] I'll send a separate response for the mmu.c table walker changes.
>
>>
>> +int split_linear_mapping(unsigned long start, unsigned long end)
>> +{
>> + int ret = 0;
>> +
>> + if (!system_supports_bbml2_noabort())
>> + return 0;
>
> Hmm... I guess the thinking here is that for !BBML2_NOABORT you are expecting
> this function should only be called in the first place if we know we are
> pte-mapped. So I guess this is ok... it just means that if we are not
> pte-mapped, warnings will be emitted while walking the pgtables (as is the case
> today). So I think this approach is ok.
>
>> +
>> + mmap_write_lock(&init_mm);
>
> What is the lock protecting? I was orignally thinking no locking should be
> needed because it's not needed for permission changes today; But I think you are
> right here and we do need locking; multiple owners could share a large leaf
> mapping, I guess? And in that case you could get concurrent attempts to split
> from both owners.
>
> I'm not really a fan of adding the extra locking though; It might introduce a
> new bottleneck. I wonder if there is a way we could do this locklessly? i.e.
> allocate the new table, then cmpxchg to insert and the loser has to free? That
> doesn't work for contiguous mappings though...
>
>> + /* NO_EXEC_MAPPINGS is needed when splitting linear map */
>> + ret = __create_pgd_mapping_locked(init_mm.pgd, virt_to_phys((void *)start),
>> + start, (end - start), __pgprot(0),
>> + __pgd_pgtable_alloc,
>> + NO_EXEC_MAPPINGS | SPLIT_MAPPINGS);
>> + mmap_write_unlock(&init_mm);
>> + flush_tlb_kernel_range(start, end);
>
> I don't believe we should need to flush the TLB when only changing entry sizes
> when BBML2 is supported. Miko's series has a massive comment explaining the
> reasoning. That applies to user space though. We should consider if this all
> works safely for kernel space too, and hopefully remove the flush.
>
>> +
>> + return ret;
>> +}
>> +
>> /*
>> * This function can only be used to modify existing table entries,
>> * without allocating new levels of table. Note that this permits the
>> @@ -676,6 +887,24 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>>
>> #endif /* CONFIG_KFENCE */
>>
>> +static inline bool force_pte_mapping(void)
>> +{
>> + /*
>> + * Can't use cpufeature API to determine whether BBML2 supported
>> + * or not since cpufeature have not been finalized yet.
>> + *
>> + * Checking the boot CPU only for now. If the boot CPU has
>> + * BBML2, paint linear mapping with block mapping. If it turns
>> + * out the secondary CPUs don't support BBML2 once cpufeature is
>> + * fininalized, the linear mapping will be repainted with PTE
>> + * mapping.
>> + */
>> + return (rodata_full && !bbml2_noabort_available()) ||
>
> So this is the case where we don't have BBML2 and need to modify protections at
> page granularity - I agree we need to force pte mappings here.
>
>> + debug_pagealloc_enabled() ||
>
> This is the case where every page is made invalid on free and valid on
> allocation, so no point in having block mappings because it will soon degenerate
> into page mappings because we will have to split on every allocation. Agree here
> too.
>
>> + arm64_kfence_can_set_direct_map() ||
>
> After looking into how kfence works, I don't agree with this one. It has a
> dedicated pool where it allocates from. That pool may be allocated early by the
> arch or may be allocated late by the core code. Either way, kfence will only
> modify protections within that pool. You current approach is forcing pte
> mappings if the pool allocation is late (i.e. not performed by the arch code
> during boot). But I think "late" is the most common case; kfence is compiled
> into the kernel but not active at boot. Certainly that's how my Ubuntu kernel is
> configured. So I think we should just ignore kfence here. If it's "early" then
> we map the pool with page granularity (as an optimization). If it's "late" your
> splitter will degenerate the whole kfence pool to page mappings over time as
> kfence_protect_page() -> set_memory_valid() is called. But the bulk of the
> linear map will remain mapped with large blocks.
>
>> + is_realm_world();
>
> I think the only reason this requires pte mappings is for
> __set_memory_enc_dec(). But that can now deal with block mappings given the
> ability to split the mappings as needed. So I think this condition can be
> removed too.
To clarify; the latter 2 would still be needed for the !BBML2_NOABORT case. So I
think the expression becomes:
return (!bbml2_noabort_available() && (rodata_full ||
arm64_kfence_can_set_direct_map() || is_realm_world())) ||
debug_pagealloc_enabled();
Thanks,
Ryan
>
>> +}
>
> Additionally, for can_set_direct_map(); at minimum it's comment should be tidied
> up, but really I think it should return true if "BBML2_NOABORT ||
> force_pte_mapping()". Because they are the conditions under which we can now
> safely modify the linear map.
>
>> +
>> static void __init map_mem(pgd_t *pgdp)
>> {
>> static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
>> @@ -701,7 +930,7 @@ static void __init map_mem(pgd_t *pgdp)
>>
>> early_kfence_pool = arm64_kfence_alloc_pool();
>>
>> - if (can_set_direct_map())
>> + if (force_pte_mapping())
>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> /*
>> @@ -1402,7 +1631,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>
>> VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>
>> - if (can_set_direct_map())
>> + if (force_pte_mapping())
>> flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 39fd1f7ff02a..25c068712cb5 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -10,6 +10,7 @@
>> #include <linux/vmalloc.h>
>>
>> #include <asm/cacheflush.h>
>> +#include <asm/mmu.h>
>> #include <asm/pgtable-prot.h>
>> #include <asm/set_memory.h>
>> #include <asm/tlbflush.h>
>> @@ -42,6 +43,8 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>> struct page_change_data *cdata = data;
>> pte_t pte = __ptep_get(ptep);
>>
>> + BUG_ON(pte_cont(pte));
>
> I don't think this is required; We want to enable using contiguous mappings
> where it makes sense. As long as we have BBML2, we can update contiguous pte
> mappings in place, as long as we update all of the ptes in the contiguous block.
> split_linear_map() should either have converted to non-cont mappings if the
> contiguous block straddled the split point, or would have left as is (or
> downgraded a PMD-block to a contpte block) if fully contained within the split
> range.
>
>> +
>> pte = clear_pte_bit(pte, cdata->clear_mask);
>> pte = set_pte_bit(pte, cdata->set_mask);
>>
>> @@ -80,8 +83,9 @@ static int change_memory_common(unsigned long addr, int numpages,
>> unsigned long start = addr;
>> unsigned long size = PAGE_SIZE * numpages;
>> unsigned long end = start + size;
>> + unsigned long l_start;
>> struct vm_struct *area;
>> - int i;
>> + int i, ret;
>>
>> if (!PAGE_ALIGNED(addr)) {
>> start &= PAGE_MASK;
>> @@ -118,7 +122,12 @@ static int change_memory_common(unsigned long addr, int numpages,
>> if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
>> pgprot_val(clear_mask) == PTE_RDONLY)) {
>> for (i = 0; i < area->nr_pages; i++) {
>> - __change_memory_common((u64)page_address(area->pages[i]),
>> + l_start = (u64)page_address(area->pages[i]);
>> + ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> + if (WARN_ON_ONCE(ret))
>> + return ret;
>
> I don't think this is the right place to integrate; I think the split should be
> done inside __change_memory_common(). Then it caters to all possibilities (i.e.
> set_memory_valid() and __set_memory_enc_dec()). This means it will run for
> vmalloc too, but for now, that will be a nop because everything should already
> be split as required on entry and in future we will get that for free.
>
> Once you have integrated Dev's series, the hook becomes
> ___change_memory_common() (3 underscores)...
>
>> +
>> + __change_memory_common(l_start,
>> PAGE_SIZE, set_mask, clear_mask);
>> }
>> }
>> @@ -174,6 +183,9 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>>
>> int set_direct_map_invalid_noflush(struct page *page)
>> {
>> + unsigned long l_start;
>> + int ret;
>> +
>> struct page_change_data data = {
>> .set_mask = __pgprot(0),
>> .clear_mask = __pgprot(PTE_VALID),
>> @@ -182,13 +194,21 @@ int set_direct_map_invalid_noflush(struct page *page)
>> if (!can_set_direct_map())
>> return 0;
>>
>> + l_start = (unsigned long)page_address(page);
>> + ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> + if (WARN_ON_ONCE(ret))
>> + return ret;
>> +
>> return apply_to_page_range(&init_mm,
>> - (unsigned long)page_address(page),
>> - PAGE_SIZE, change_page_range, &data);
>> + l_start, PAGE_SIZE, change_page_range,
>> + &data);
>
> ...and once integrated with Dev's series you don't need any changes here...
>
>> }
>>
>> int set_direct_map_default_noflush(struct page *page)
>> {
>> + unsigned long l_start;
>> + int ret;
>> +
>> struct page_change_data data = {
>> .set_mask = __pgprot(PTE_VALID | PTE_WRITE),
>> .clear_mask = __pgprot(PTE_RDONLY),
>> @@ -197,9 +217,14 @@ int set_direct_map_default_noflush(struct page *page)
>> if (!can_set_direct_map())
>> return 0;
>>
>> + l_start = (unsigned long)page_address(page);
>> + ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> + if (WARN_ON_ONCE(ret))
>> + return ret;
>> +
>> return apply_to_page_range(&init_mm,
>> - (unsigned long)page_address(page),
>> - PAGE_SIZE, change_page_range, &data);
>> + l_start, PAGE_SIZE, change_page_range,
>> + &data);
>
> ...or here.
>
> Thanks,
> Ryan
>
>> }
>>
>> static int __set_memory_enc_dec(unsigned long addr,
>
Powered by blists - more mailing lists