linux-kernel - Re: [PATCH 3/4] arm64: mm: support large block mapping when rodata=full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa932ea3-d408-40a1-a734-60ac8fc73a50@arm.com>
Date: Mon, 16 Jun 2025 13:33:29 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Yang Shi <yang@...amperecomputing.com>, will@...nel.org,
 catalin.marinas@....com, Miko.Lenczewski@....com, dev.jain@....com,
 scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/4] arm64: mm: support large block mapping when
 rodata=full

On 16/06/2025 12:58, Ryan Roberts wrote:
> On 31/05/2025 03:41, Yang Shi wrote:
>> When rodata=full is specified, kernel linear mapping has to be mapped at
>> PTE level since large page table can't be split due to break-before-make
>> rule on ARM64.
>>
>> This resulted in a couple of problems:
>>   - performance degradation
>>   - more TLB pressure
>>   - memory waste for kernel page table
>>
>> With FEAT_BBM level 2 support, splitting large block page table to
>> smaller ones doesn't need to make the page table entry invalid anymore.
>> This allows kernel split large block mapping on the fly.
>>
>> Add kernel page table split support and use large block mapping by
>> default when FEAT_BBM level 2 is supported for rodata=full.  When
>> changing permissions for kernel linear mapping, the page table will be
>> split to smaller size.
>>
>> The machine without FEAT_BBM level 2 will fallback to have kernel linear
>> mapping PTE-mapped when rodata=full.
>>
>> With this we saw significant performance boost with some benchmarks and
>> much less memory consumption on my AmpereOne machine (192 cores, 1P) with
>> 256GB memory.
>>
>> * Memory use after boot
>> Before:
>> MemTotal:       258988984 kB
>> MemFree:        254821700 kB
>>
>> After:
>> MemTotal:       259505132 kB
>> MemFree:        255410264 kB
>>
>> Around 500MB more memory are free to use.  The larger the machine, the
>> more memory saved.
>>
>> * Memcached
>> We saw performance degradation when running Memcached benchmark with
>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>> latency is reduced by around 9.6%.
>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>> MPKI is reduced by 28.5%.
>>
>> The benchmark data is now on par with rodata=on too.
>>
>> * Disk encryption (dm-crypt) benchmark
>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>> encryption (by dm-crypt).
>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>     --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>     --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>     --name=iops-test-job --eta-newline=1 --size 100G
>>
>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>> number of good case is around 90% more than the best number of bad case).
>> The bandwidth is increased and the avg clat is reduced proportionally.
>>
>> * Sequential file read
>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>> The bandwidth is increased by 150%.
>>
>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>> ---
>>  arch/arm64/include/asm/cpufeature.h |  26 +++
>>  arch/arm64/include/asm/mmu.h        |   1 +
>>  arch/arm64/include/asm/pgtable.h    |  12 +-
>>  arch/arm64/kernel/cpufeature.c      |   2 +-
>>  arch/arm64/mm/mmu.c                 | 269 +++++++++++++++++++++++++---
>>  arch/arm64/mm/pageattr.c            |  37 +++-
>>  6 files changed, 319 insertions(+), 28 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index 8f36ffa16b73..a95806980298 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -1053,6 +1053,32 @@ static inline bool cpu_has_lpa2(void)
>>  #endif
>>  }
>>  
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr);
>> +
>> +static inline bool has_nobbml2_override(void)
>> +{
>> +	u64 mmfr2;
>> +	unsigned int bbm;
>> +
>> +	mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
>> +	mmfr2 &= ~id_aa64mmfr2_override.mask;
>> +	mmfr2 |= id_aa64mmfr2_override.val;
>> +	bbm = cpuid_feature_extract_unsigned_field(mmfr2,
>> +						   ID_AA64MMFR2_EL1_BBM_SHIFT);
>> +	return bbm == 0;
>> +}
>> +
>> +/*
>> + * Called at early boot stage on boot CPU before cpu info and cpu feature
>> + * are ready.
>> + */
>> +static inline bool bbml2_noabort_available(void)
>> +{
>> +	return IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT) &&
>> +	       cpu_has_bbml2_noabort(read_cpuid_id()) &&
>> +	       !has_nobbml2_override();
> 
> Based on Will's feedback, The Kconfig and the cmdline override will both
> disappear in Miko's next version and we will only use the MIDR list to decide
> BBML2_NOABORT status, so this will significantly simplify. Sorry about the churn
> here.
> 
>> +}
>> +
>>  #endif /* __ASSEMBLY__ */
>>  
>>  #endif
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 6e8aa8e72601..2693d63bf837 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>>  			       pgprot_t prot, bool page_mappings_only);
>>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>>  extern void mark_linear_text_alias_ro(void);
>> +extern int split_linear_mapping(unsigned long start, unsigned long end);
> 
> nit: Perhaps split_leaf_mapping() or split_kernel_pgtable_mapping() or something
> similar is more generic which will benefit us in future when using this for
> vmalloc too?
> 
>>  
>>  /*
>>   * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index d3b538be1500..bf3cef31d243 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -293,6 +293,11 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
>>  	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
>>  }
>>  
>> +static inline pmd_t pmd_mknoncont(pmd_t pmd)
>> +{
>> +	return __pmd(pmd_val(pmd) & ~PMD_SECT_CONT);
>> +}
>> +
>>  static inline pte_t pte_mkdevmap(pte_t pte)
>>  {
>>  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>> @@ -769,7 +774,7 @@ static inline bool in_swapper_pgdir(void *addr)
>>  	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>>  }
>>  
>> -static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
>>  {
>>  #ifdef __PAGETABLE_PMD_FOLDED
>>  	if (in_swapper_pgdir(pmdp)) {
>> @@ -779,6 +784,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>  #endif /* __PAGETABLE_PMD_FOLDED */
>>  
>>  	WRITE_ONCE(*pmdp, pmd);
>> +}
>> +
>> +static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +{
>> +	__set_pmd_nosync(pmdp, pmd);
>>  
>>  	if (pmd_valid(pmd)) {
>>  		dsb(ishst);
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index e879bfcf853b..5fc2a4a804de 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -2209,7 +2209,7 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
>>  	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
>>  }
>>  
>> -static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>>  {
>>  	/*
>>  	 * We want to allow usage of bbml2 in as wide a range of kernel contexts
> 
> 
> [...] I'll send a separate response for the mmu.c table walker changes.
> 
>>  
>> +int split_linear_mapping(unsigned long start, unsigned long end)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!system_supports_bbml2_noabort())
>> +		return 0;
> 
> Hmm... I guess the thinking here is that for !BBML2_NOABORT you are expecting
> this function should only be called in the first place if we know we are
> pte-mapped. So I guess this is ok... it just means that if we are not
> pte-mapped, warnings will be emitted while walking the pgtables (as is the case
> today). So I think this approach is ok.
> 
>> +
>> +	mmap_write_lock(&init_mm);
> 
> What is the lock protecting? I was orignally thinking no locking should be
> needed because it's not needed for permission changes today; But I think you are
> right here and we do need locking; multiple owners could share a large leaf
> mapping, I guess? And in that case you could get concurrent attempts to split
> from both owners.
> 
> I'm not really a fan of adding the extra locking though; It might introduce a
> new bottleneck. I wonder if there is a way we could do this locklessly? i.e.
> allocate the new table, then cmpxchg to insert and the loser has to free? That
> doesn't work for contiguous mappings though...
> 
>> +	/* NO_EXEC_MAPPINGS is needed when splitting linear map */
>> +	ret = __create_pgd_mapping_locked(init_mm.pgd, virt_to_phys((void *)start),
>> +					  start, (end - start), __pgprot(0),
>> +					  __pgd_pgtable_alloc,
>> +					  NO_EXEC_MAPPINGS | SPLIT_MAPPINGS);
>> +	mmap_write_unlock(&init_mm);
>> +	flush_tlb_kernel_range(start, end);
> 
> I don't believe we should need to flush the TLB when only changing entry sizes
> when BBML2 is supported. Miko's series has a massive comment explaining the
> reasoning. That applies to user space though. We should consider if this all
> works safely for kernel space too, and hopefully remove the flush.
> 
>> +
>> +	return ret;
>> +}
>> +
>>  /*
>>   * This function can only be used to modify existing table entries,
>>   * without allocating new levels of table. Note that this permits the
>> @@ -676,6 +887,24 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>>  
>>  #endif /* CONFIG_KFENCE */
>>  
>> +static inline bool force_pte_mapping(void)
>> +{
>> +	/*
>> +	 * Can't use cpufeature API to determine whether BBML2 supported
>> +	 * or not since cpufeature have not been finalized yet.
>> +	 *
>> +	 * Checking the boot CPU only for now.  If the boot CPU has
>> +	 * BBML2, paint linear mapping with block mapping.  If it turns
>> +	 * out the secondary CPUs don't support BBML2 once cpufeature is
>> +	 * fininalized, the linear mapping will be repainted with PTE
>> +	 * mapping.
>> +	 */
>> +	return (rodata_full && !bbml2_noabort_available()) ||
> 
> So this is the case where we don't have BBML2 and need to modify protections at
> page granularity - I agree we need to force pte mappings here.
> 
>> +		debug_pagealloc_enabled() ||
> 
> This is the case where every page is made invalid on free and valid on
> allocation, so no point in having block mappings because it will soon degenerate
> into page mappings because we will have to split on every allocation. Agree here
> too.
> 
>> +		arm64_kfence_can_set_direct_map() ||
> 
> After looking into how kfence works, I don't agree with this one. It has a
> dedicated pool where it allocates from. That pool may be allocated early by the
> arch or may be allocated late by the core code. Either way, kfence will only
> modify protections within that pool. You current approach is forcing pte
> mappings if the pool allocation is late (i.e. not performed by the arch code
> during boot). But I think "late" is the most common case; kfence is compiled
> into the kernel but not active at boot. Certainly that's how my Ubuntu kernel is
> configured. So I think we should just ignore kfence here. If it's "early" then
> we map the pool with page granularity (as an optimization). If it's "late" your
> splitter will degenerate the whole kfence pool to page mappings over time as
> kfence_protect_page() -> set_memory_valid() is called. But the bulk of the
> linear map will remain mapped with large blocks.
> 
>> +		is_realm_world();
> 
> I think the only reason this requires pte mappings is for
> __set_memory_enc_dec(). But that can now deal with block mappings given the
> ability to split the mappings as needed. So I think this condition can be
> removed too.

To clarify; the latter 2 would still be needed for the !BBML2_NOABORT case. So I
think the expression becomes:

	return (!bbml2_noabort_available() && (rodata_full ||
		arm64_kfence_can_set_direct_map() || is_realm_world())) ||
		debug_pagealloc_enabled();

Thanks,
Ryan

> 
>> +}
> 
> Additionally, for can_set_direct_map(); at minimum it's comment should be tidied
> up, but really I think it should return true if "BBML2_NOABORT ||
> force_pte_mapping()". Because they are the conditions under which we can now
> safely modify the linear map.
> 
>> +
>>  static void __init map_mem(pgd_t *pgdp)
>>  {
>>  	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
>> @@ -701,7 +930,7 @@ static void __init map_mem(pgd_t *pgdp)
>>  
>>  	early_kfence_pool = arm64_kfence_alloc_pool();
>>  
>> -	if (can_set_direct_map())
>> +	if (force_pte_mapping())
>>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>  	/*
>> @@ -1402,7 +1631,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>  
>>  	VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>  
>> -	if (can_set_direct_map())
>> +	if (force_pte_mapping())
>>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>  	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 39fd1f7ff02a..25c068712cb5 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -10,6 +10,7 @@
>>  #include <linux/vmalloc.h>
>>  
>>  #include <asm/cacheflush.h>
>> +#include <asm/mmu.h>
>>  #include <asm/pgtable-prot.h>
>>  #include <asm/set_memory.h>
>>  #include <asm/tlbflush.h>
>> @@ -42,6 +43,8 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>>  	struct page_change_data *cdata = data;
>>  	pte_t pte = __ptep_get(ptep);
>>  
>> +	BUG_ON(pte_cont(pte));
> 
> I don't think this is required; We want to enable using contiguous mappings
> where it makes sense. As long as we have BBML2, we can update contiguous pte
> mappings in place, as long as we update all of the ptes in the contiguous block.
> split_linear_map() should either have converted to non-cont mappings if the
> contiguous block straddled the split point, or would have left as is (or
> downgraded a PMD-block to a contpte block) if fully contained within the split
> range.
> 
>> +
>>  	pte = clear_pte_bit(pte, cdata->clear_mask);
>>  	pte = set_pte_bit(pte, cdata->set_mask);
>>  
>> @@ -80,8 +83,9 @@ static int change_memory_common(unsigned long addr, int numpages,
>>  	unsigned long start = addr;
>>  	unsigned long size = PAGE_SIZE * numpages;
>>  	unsigned long end = start + size;
>> +	unsigned long l_start;
>>  	struct vm_struct *area;
>> -	int i;
>> +	int i, ret;
>>  
>>  	if (!PAGE_ALIGNED(addr)) {
>>  		start &= PAGE_MASK;
>> @@ -118,7 +122,12 @@ static int change_memory_common(unsigned long addr, int numpages,
>>  	if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
>>  			    pgprot_val(clear_mask) == PTE_RDONLY)) {
>>  		for (i = 0; i < area->nr_pages; i++) {
>> -			__change_memory_common((u64)page_address(area->pages[i]),
>> +			l_start = (u64)page_address(area->pages[i]);
>> +			ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> +			if (WARN_ON_ONCE(ret))
>> +				return ret;
> 
> I don't think this is the right place to integrate; I think the split should be
> done inside __change_memory_common(). Then it caters to all possibilities (i.e.
> set_memory_valid() and __set_memory_enc_dec()). This means it will run for
> vmalloc too, but for now, that will be a nop because everything should already
> be split as required on entry and in future we will get that for free.
> 
> Once you have integrated Dev's series, the hook becomes
> ___change_memory_common() (3 underscores)...
> 
>> +
>> +			__change_memory_common(l_start,
>>  					       PAGE_SIZE, set_mask, clear_mask);
>>  		}
>>  	}
>> @@ -174,6 +183,9 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>>  
>>  int set_direct_map_invalid_noflush(struct page *page)
>>  {
>> +	unsigned long l_start;
>> +	int ret;
>> +
>>  	struct page_change_data data = {
>>  		.set_mask = __pgprot(0),
>>  		.clear_mask = __pgprot(PTE_VALID),
>> @@ -182,13 +194,21 @@ int set_direct_map_invalid_noflush(struct page *page)
>>  	if (!can_set_direct_map())
>>  		return 0;
>>  
>> +	l_start = (unsigned long)page_address(page);
>> +	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> +	if (WARN_ON_ONCE(ret))
>> +		return ret;
>> +
>>  	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +				   l_start, PAGE_SIZE, change_page_range,
>> +				   &data);
> 
> ...and once integrated with Dev's series you don't need any changes here...
> 
>>  }
>>  
>>  int set_direct_map_default_noflush(struct page *page)
>>  {
>> +	unsigned long l_start;
>> +	int ret;
>> +
>>  	struct page_change_data data = {
>>  		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
>>  		.clear_mask = __pgprot(PTE_RDONLY),
>> @@ -197,9 +217,14 @@ int set_direct_map_default_noflush(struct page *page)
>>  	if (!can_set_direct_map())
>>  		return 0;
>>  
>> +	l_start = (unsigned long)page_address(page);
>> +	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
>> +	if (WARN_ON_ONCE(ret))
>> +		return ret;
>> +
>>  	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +				   l_start, PAGE_SIZE, change_page_range,
>> +				   &data);
> 
> ...or here.
> 
> Thanks,
> Ryan
> 
>>  }
>>  
>>  static int __set_memory_enc_dec(unsigned long addr,
>