linux-kernel - Re: [PATCH v7 1/6] arm64: Enable permission change on arm64 kernel block mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <42b3b394-a6ec-4d70-b5a1-941a0628cd25@arm.com>
Date: Thu, 4 Sep 2025 12:06:29 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Jinjiang Tu <tujinjiang@...wei.com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Yang Shi <yang@...amperecomputing.com>, Ard Biesheuvel <ardb@...nel.org>,
 Dev Jain <dev.jain@....com>, scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Subject: Re: [PATCH v7 1/6] arm64: Enable permission change on arm64 kernel
 block mappings

On 04/09/2025 04:40, Jinjiang Tu wrote:
> 
> 在 2025/8/29 19:52, Ryan Roberts 写道:
>> From: Dev Jain <dev.jain@....com>
>>
>> This patch paves the path to enable huge mappings in vmalloc space and
>> linear map space by default on arm64. For this we must ensure that we
>> can handle any permission games on the kernel (init_mm) pagetable.
>> Previously, __change_memory_common() used apply_to_page_range() which
>> does not support changing permissions for block mappings. We move away
>> from this by using the pagewalk API, similar to what riscv does right
>> now. It is the responsibility of the caller to ensure that the range
>> over which permissions are being changed falls on leaf mapping
>> boundaries. For systems with BBML2, this will be handled in future
>> patches by dyanmically splitting the mappings when required.
>>
>> Unlike apply_to_page_range(), the pagewalk API currently enforces the
>> init_mm.mmap_lock to be held. To avoid the unnecessary bottleneck of the
>> mmap_lock for our usecase, this patch extends this generic API to be
>> used locklessly, so as to retain the existing behaviour for changing
>> permissions. Apart from this reason, it is noted at [1] that KFENCE can
>> manipulate kernel pgtable entries during softirqs. It does this by
>> calling set_memory_valid() -> __change_memory_common(). This being a
>> non-sleepable context, we cannot take the init_mm mmap lock.
>>
>> Add comments to highlight the conditions under which we can use the
>> lockless variant - no underlying VMA, and the user having exclusive
>> control over the range, thus guaranteeing no concurrent access.
>>
>> We require that the start and end of a given range do not partially
>> overlap block mappings, or cont mappings. Return -EINVAL in case a
>> partial block mapping is detected in any of the PGD/P4D/PUD/PMD levels;
>> add a corresponding comment in update_range_prot() to warn that
>> eliminating such a condition is the responsibility of the caller.
>>
>> Note that, the pte level callback may change permissions for a whole
>> contpte block, and that will be done one pte at a time, as opposed to an
>> atomic operation for the block mappings. This is fine as any access will
>> decode either the old or the new permission until the TLBI.
>>
>> apply_to_page_range() currently performs all pte level callbacks while
>> in lazy mmu mode. Since arm64 can optimize performance by batching
>> barriers when modifying kernel pgtables in lazy mmu mode, we would like
>> to continue to benefit from this optimisation. Unfortunately
>> walk_kernel_page_table_range() does not use lazy mmu mode. However,
>> since the pagewalk framework is not allocating any memory, we can safely
>> bracket the whole operation inside lazy mmu mode ourselves. Therefore,
>> wrap the call to walk_kernel_page_table_range() with the lazy MMU
>> helpers.
>>
>> Link: https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-
>> ae8a-7c48d26a927e@....com/ [1]
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@....com>
>> ---
>>  arch/arm64/mm/pageattr.c | 153 +++++++++++++++++++++++++++++++--------
>>  include/linux/pagewalk.h |   3 +
>>  mm/pagewalk.c            |  36 ++++++---
>>  3 files changed, 149 insertions(+), 43 deletions(-)
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 04d4a8f676db..6da8cbc32f46 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -8,6 +8,7 @@
>>  #include <linux/mem_encrypt.h>
>>  #include <linux/sched.h>
>>  #include <linux/vmalloc.h>
>> +#include <linux/pagewalk.h>
>>  
>>  #include <asm/cacheflush.h>
>>  #include <asm/pgtable-prot.h>
>> @@ -20,6 +21,99 @@ struct page_change_data {
>>  	pgprot_t clear_mask;
>>  };
>>  
>> +static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
>> +{
>> +	struct page_change_data *masks = walk->private;
>> +
>> +	val &= ~(pgprot_val(masks->clear_mask));
>> +	val |= (pgprot_val(masks->set_mask));
>> +
>> +	return val;
>> +}
>> +
>> +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pgd_t val = pgdp_get(pgd);
>> +
>> +	if (pgd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
>> +			return -EINVAL;
>> +		val = __pgd(set_pageattr_masks(pgd_val(val), walk));
>> +		set_pgd(pgd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	p4d_t val = p4dp_get(p4d);
>> +
>> +	if (p4d_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
>> +			return -EINVAL;
>> +		val = __p4d(set_pageattr_masks(p4d_val(val), walk));
>> +		set_p4d(p4d, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pud_t val = pudp_get(pud);
>> +
>> +	if (pud_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
>> +			return -EINVAL;
>> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
>> +		set_pud(pud, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pmd_t val = pmdp_get(pmd);
>> +
>> +	if (pmd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
>> +			return -EINVAL;
>> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
>> +		set_pmd(pmd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
> 
> #define pmd_leaf(pmd)        (pmd_present(pmd) && !pmd_table(pmd))
> #define pmd_present(pmd)	pte_present(pmd_pte(pmd))
> #define pte_present(pte)	(pte_valid(pte) || pte_present_invalid(pte))
> 
> When PTE_VALID is cleared, pmd_leaf() will return false. As a result, the
> caller won't be able to set PTE_VALID for the pmd.

Ouch - good spot!

So this will break set_direct_map_default_noflush(), set_memory_valid() and
__set_memory_enc_dec(). I wonder why it doesn't cause problems with mm selftests
- I'll investigate.

Anyway, I think the simplest solution is to replace the pmd_leaf()/pud_leaf()
checks with !pmd_table()/!pud_table() checks. (and I think we can entirely
remove the pgd and p4d callbacks because pgd_leaf() and p4d_leaf() always return
false on arm64). As long as no code ever clears PTE_VALID from a table entry, I
think this works.

Thanks,
Ryan

> 
>> +
>> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pte_t val = __ptep_get(pte);
>> +
>> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
>> +	__set_pte(pte, val);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mm_walk_ops pageattr_ops = {
>> +	.pgd_entry	= pageattr_pgd_entry,
>> +	.p4d_entry	= pageattr_p4d_entry,
>> +	.pud_entry	= pageattr_pud_entry,
>> +	.pmd_entry	= pageattr_pmd_entry,
>> +	.pte_entry	= pageattr_pte_entry,
>> +};
>> +
>>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>>  
>>  bool can_set_direct_map(void)
>> @@ -37,32 +131,35 @@ bool can_set_direct_map(void)
>>  		arm64_kfence_can_set_direct_map() || is_realm_world();
>>  }
>>  
>> -static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>> +static int update_range_prot(unsigned long start, unsigned long size,
>> +			     pgprot_t set_mask, pgprot_t clear_mask)
>>  {
>> -	struct page_change_data *cdata = data;
>> -	pte_t pte = __ptep_get(ptep);
>> +	struct page_change_data data;
>> +	int ret;
>>  
>> -	pte = clear_pte_bit(pte, cdata->clear_mask);
>> -	pte = set_pte_bit(pte, cdata->set_mask);
>> +	data.set_mask = set_mask;
>> +	data.clear_mask = clear_mask;
>>  
>> -	__set_pte(ptep, pte);
>> -	return 0;
>> +	arch_enter_lazy_mmu_mode();
>> +
>> +	/*
>> +	 * The caller must ensure that the range we are operating on does not
>> +	 * partially overlap a block mapping, or a cont mapping. Any such case
>> +	 * must be eliminated by splitting the mapping.
>> +	 */
>> +	ret = walk_kernel_page_table_range_lockless(start, start + size,
>> +						    &pageattr_ops, NULL, &data);
>> +	arch_leave_lazy_mmu_mode();
>> +
>> +	return ret;
>>  }
>>  
>> -/*
>> - * This function assumes that the range is mapped with PAGE_SIZE pages.
>> - */
>>  static int __change_memory_common(unsigned long start, unsigned long size,
>> -				pgprot_t set_mask, pgprot_t clear_mask)
>> +				  pgprot_t set_mask, pgprot_t clear_mask)
>>  {
>> -	struct page_change_data data;
>>  	int ret;
>>  
>> -	data.set_mask = set_mask;
>> -	data.clear_mask = clear_mask;
>> -
>> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
>> -					&data);
>> +	ret = update_range_prot(start, size, set_mask, clear_mask);
>>  
>>  	/*
>>  	 * If the memory is being made valid without changing any other bits
>> @@ -174,32 +271,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>>  
>>  int set_direct_map_invalid_noflush(struct page *page)
>>  {
>> -	struct page_change_data data = {
>> -		.set_mask = __pgprot(0),
>> -		.clear_mask = __pgprot(PTE_VALID),
>> -	};
>> +	pgprot_t clear_mask = __pgprot(PTE_VALID);
>> +	pgprot_t set_mask = __pgprot(0);
>>  
>>  	if (!can_set_direct_map())
>>  		return 0;
>>  
>> -	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +	return update_range_prot((unsigned long)page_address(page),
>> +				 PAGE_SIZE, set_mask, clear_mask);
>>  }
>>  
>>  int set_direct_map_default_noflush(struct page *page)
>>  {
>> -	struct page_change_data data = {
>> -		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
>> -		.clear_mask = __pgprot(PTE_RDONLY),
>> -	};
>> +	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
>> +	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
>>  
>>  	if (!can_set_direct_map())
>>  		return 0;
>>  
>> -	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +	return update_range_prot((unsigned long)page_address(page),
>> +				 PAGE_SIZE, set_mask, clear_mask);
>>  }
>>  
>>  static int __set_memory_enc_dec(unsigned long addr,
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 682472c15495..88e18615dd72 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -134,6 +134,9 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>>  int walk_kernel_page_table_range(unsigned long start,
>>  		unsigned long end, const struct mm_walk_ops *ops,
>>  		pgd_t *pgd, void *private);
>> +int walk_kernel_page_table_range_lockless(unsigned long start,
>> +		unsigned long end, const struct mm_walk_ops *ops,
>> +		pgd_t *pgd, void *private);
>>  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>>  			unsigned long end, const struct mm_walk_ops *ops,
>>  			void *private);
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 648038247a8d..936689d8bcac 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -606,10 +606,32 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>>  int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>  		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
>>  {
>> -	struct mm_struct *mm = &init_mm;
>> +	/*
>> +	 * Kernel intermediate page tables are usually not freed, so the mmap
>> +	 * read lock is sufficient. But there are some exceptions.
>> +	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
>> +	 * to prevent the intermediate kernel pages tables belonging to the
>> +	 * specified address range from being freed. The caller should take
>> +	 * other actions to prevent this race.
>> +	 */
>> +	mmap_assert_locked(&init_mm);
>> +
>> +	return walk_kernel_page_table_range_lockless(start, end, ops, pgd,
>> +						     private);
>> +}
>> +
>> +/*
>> + * Use this function to walk the kernel page tables locklessly. It should be
>> + * guaranteed that the caller has exclusive access over the range they are
>> + * operating on - that there should be no concurrent access, for example,
>> + * changing permissions for vmalloc objects.
>> + */
>> +int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
>> +		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
>> +{
>>  	struct mm_walk walk = {
>>  		.ops		= ops,
>> -		.mm		= mm,
>> +		.mm		= &init_mm,
>>  		.pgd		= pgd,
>>  		.private	= private,
>>  		.no_vma		= true
>> @@ -620,16 +642,6 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>  	if (!check_ops_valid(ops))
>>  		return -EINVAL;
>>  
>> -	/*
>> -	 * Kernel intermediate page tables are usually not freed, so the mmap
>> -	 * read lock is sufficient. But there are some exceptions.
>> -	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
>> -	 * to prevent the intermediate kernel pages tables belonging to the
>> -	 * specified address range from being freed. The caller should take
>> -	 * other actions to prevent this race.
>> -	 */
>> -	mmap_assert_locked(mm);
>> -
>>  	return walk_pgd_range(start, end, &walk);
>>  }
>>