linux-kernel - Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3611cfeb-53d5-4db5-95a1-1d095edfc3c9@arm.com>
Date: Mon, 3 Nov 2025 16:28:44 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Will Deacon <will@...nel.org>
Cc: catalin.marinas@....com, yang@...amperecomputing.com, david@...hat.com,
 ardb@...nel.org, dev.jain@....com, scott@...amperecomputing.com,
 cl@...two.org, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, Guenter Roeck <groeck@...gle.com>
Subject: Re: [PATCH v1] arm64: mm: Don't sleep in split_kernel_leaf_mapping()
 when in atomic context

Thanks for the fast review!


On 03/11/2025 15:37, Will Deacon wrote:
> On Mon, Nov 03, 2025 at 12:57:37PM +0000, Ryan Roberts wrote:
>> It has been reported that split_kernel_leaf_mapping() is trying to sleep
>> in non-sleepable context. It does this when acquiring the
>> pgtable_split_lock mutex, when either CONFIG_DEBUG_PAGEALLOC or
>> CONFIG_KFENCE are enabled, which change linear map permissions within
>> softirq context during memory allocation and/or freeing. All other paths
>> into this function are called from sleepable context and so are safe.
>>
>> But it turns out that the memory for which these 2 features may attempt
>> to modify the permissions is always mapped by pte, so there is no need
>> to attempt to split the mapping. So let's exit early in these cases and
>> avoid attempting to take the mutex.
>>
>> There is one wrinkle to this approach; late-initialized kfence allocates
>> it's pool from the buddy which may be block mapped. So we must hook that
>> allocation and convert it to pte-mappings up front. Previously this was
>> done as a side-effect of kfence protecting all the individual pages in
>> its pool at init-time, but this no longer works due to the added early
>> exit path in split_kernel_leaf_mapping().
>>
>> So instead, do this via the existing arch_kfence_init_pool() arch hook,
>> and reuse the existing linear_map_split_to_ptes() infrastructure. This
>> will now also be more efficient as a result.
>>
>> Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
>> Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
>> Tested-by: Guenter Roeck <groeck@...gle.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>> ---
>>
>> Hi All,
>>
>> This is a fuller fix than the suggestion I sent yesterday, and works correctly
>> with late-init kfence (thanks to Yang Shi for pointing that out).
>>
>> I've verified this on AmpereOne with CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE
>> individually, and I've also forced it to take the linear_map_split_to_ptes() to
>> verify that I haven't broken it during the refactoring.
>>
>> I've kept Guenter's T-b since the early-init kfence path that he was testing is
>> unchanged.
>>
>> Assuming nobody spots any issues, I'fd like to get it into the next round of
>> arm64 bug fixes for this cycle.
>>
>> Thanks,
>> Ryan
>>
>>
>>  arch/arm64/include/asm/kfence.h |  4 +-
>>  arch/arm64/mm/mmu.c             | 92 +++++++++++++++++++++++----------
>>  2 files changed, 68 insertions(+), 28 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
>> index a81937fae9f6..4a921e06d750 100644
>> --- a/arch/arm64/include/asm/kfence.h
>> +++ b/arch/arm64/include/asm/kfence.h
>> @@ -10,8 +10,6 @@
>>
>>  #include <asm/set_memory.h>
>>
>> -static inline bool arch_kfence_init_pool(void) { return true; }
>> -
>>  static inline bool kfence_protect_page(unsigned long addr, bool protect)
>>  {
>>  	set_memory_valid(addr, 1, !protect);
>> @@ -25,8 +23,10 @@ static inline bool arm64_kfence_can_set_direct_map(void)
>>  {
>>  	return !kfence_early_init;
>>  }
>> +bool arch_kfence_init_pool(void);
>>  #else /* CONFIG_KFENCE */
>>  static inline bool arm64_kfence_can_set_direct_map(void) { return false; }
>> +static inline bool arch_kfence_init_pool(void) { return false; }
> 
> Why do we need this for the !KFENCE case?

We don't, my bad. I'll remove.

> 
>>  #endif /* CONFIG_KFENCE */
>>
>>  #endif /* __ASM_KFENCE_H */
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index b8d37eb037fc..0385e9b17ab0 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>  	return ret;
>>  }
>>
>> +static inline bool force_pte_mapping(void)
>> +{
>> +	bool bbml2 = system_capabilities_finalized() ?
>> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
>> +
>> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
>> +			   is_realm_world())) ||
>> +		debug_pagealloc_enabled();
>> +}
>> +
>>  static DEFINE_MUTEX(pgtable_split_lock);
>>
>>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  	if (!system_supports_bbml2_noabort())
>>  		return 0;
>>
>> +	/*
>> +	 * If the region is within a pte-mapped area, there is no need to try to
>> +	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>> +	 * change permissions from softirq context so for those cases (which are
>> +	 * always pte-mapped), we must not go any further because taking the
>> +	 * mutex below may sleep.
>> +	 */
>> +	if (force_pte_mapping() || is_kfence_address((void *)start))
>> +		return 0;
>> +
>>  	/*
>>  	 * Ensure start and end are at least page-aligned since this is the
>>  	 * finest granularity we can split to.
>> @@ -758,30 +778,30 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>  	pud_t pud = pudp_get(pudp);
>>  	int ret = 0;
>>
>>  	if (pud_leaf(pud))
>> -		ret = split_pud(pudp, pud, GFP_ATOMIC, false);
>> +		ret = split_pud(pudp, pud, gfp, false);
>>
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>> +	gfp_t gfp = *(gfp_t *)walk->private;
>>  	pmd_t pmd = pmdp_get(pmdp);
>>  	int ret = 0;
>>
>>  	if (pmd_leaf(pmd)) {
>>  		if (pmd_cont(pmd))
>>  			split_contpmd(pmdp);
>> -		ret = split_pmd(pmdp, pmd, GFP_ATOMIC, false);
>> +		ret = split_pmd(pmdp, pmd, gfp, false);
>>
>>  		/*
>>  		 * We have split the pmd directly to ptes so there is no need to
>> @@ -793,9 +813,8 @@ static int __init split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>  	return ret;
>>  }
>>
>> -static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>> -					  unsigned long next,
>> -					  struct mm_walk *walk)
>> +static int split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>> +				   unsigned long next, struct mm_walk *walk)
>>  {
>>  	pte_t pte = __ptep_get(ptep);
>>
>> @@ -805,12 +824,24 @@ static int __init split_to_ptes_pte_entry(pte_t *ptep, unsigned long addr,
>>  	return 0;
>>  }
>>
>> -static const struct mm_walk_ops split_to_ptes_ops __initconst = {
>> +static const struct mm_walk_ops split_to_ptes_ops = {
>>  	.pud_entry	= split_to_ptes_pud_entry,
>>  	.pmd_entry	= split_to_ptes_pmd_entry,
>>  	.pte_entry	= split_to_ptes_pte_entry,
>>  };
>>
>> +static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>> +{
>> +	int ret;
>> +
>> +	arch_enter_lazy_mmu_mode();
>> +	ret = walk_kernel_page_table_range_lockless(start, end,
>> +					&split_to_ptes_ops, NULL, &gfp);
>> +	arch_leave_lazy_mmu_mode();
> 
> Why are you entering/leaving lazy mode now? linear_map_split_to_ptes()
> calls flush_tlb_kernel_range() right after this so now it looks like
> we have more barriers than we need there.

Without the lazy mmu block, every write to every pte (or pmd/pud) will cause a
dsb and isb to be emitted. With the lazy mmu block, we only emit a single
dsb/isb at the end of the block.

linear_map_split_to_ptes() didn't previously have a lazy mmu block; that was an
oversight, I believe. So when refactoring I thought it made sense to make it
common for both cases.

Yes, the flush_tlb_kernel_range() also has the barriers, so the lazy mmu mode is
reducing from a gazillion barriers to 2. We could further optimize from 2 to 1,
but I doubt the performance improvement will be measurable.

Perhaps I've misunderstood your point...?

> 
>> +
>> +	return ret;
>> +}
>> +
>>  static bool linear_map_requires_bbml2 __initdata;
>>
>>  u32 idmap_kpti_bbml2_flag;
>> @@ -847,11 +878,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>  		 * PTE. The kernel alias remains static throughout runtime so
>>  		 * can continue to be safely mapped with large mappings.
>>  		 */
>> -		ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>> -						&split_to_ptes_ops, NULL, NULL);
>> +		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>>  		if (!ret)
>> -			ret = walk_kernel_page_table_range_lockless(kend, lend,
>> -						&split_to_ptes_ops, NULL, NULL);
>> +			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>>  		if (ret)
>>  			panic("Failed to split linear map\n");
>>  		flush_tlb_kernel_range(lstart, lend);
>> @@ -1002,6 +1031,27 @@ static void __init arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp)
>>  	memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>>  	__kfence_pool = phys_to_virt(kfence_pool);
>>  }
>> +
>> +bool arch_kfence_init_pool(void)
>> +{
>> +	unsigned long start = (unsigned long)__kfence_pool;
>> +	unsigned long end = start + KFENCE_POOL_SIZE;
>> +	int ret;
>> +
>> +	/* Exit early if we know the linear map is already pte-mapped. */
>> +	if (!system_supports_bbml2_noabort() || force_pte_mapping())
>> +		return true;
>> +
>> +	/* Kfence pool is already pte-mapped for the early init case. */
>> +	if (kfence_early_init)
>> +		return true;
>> +
>> +	mutex_lock(&pgtable_split_lock);
>> +	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
>> +	mutex_unlock(&pgtable_split_lock);
>> +
>> +	return ret ? false : true;
> 
> return !ret

Yes good point - will fix in the next spin.

Thanks,
Ryan

> 
> Will