lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1e89f07c-7620-4f9b-b647-c585824ce015@arm.com>
Date: Tue, 10 Jun 2025 18:10:03 +0530
From: Dev Jain <dev.jain@....com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: akpm@...ux-foundation.org, david@...hat.com, catalin.marinas@....com,
 will@...nel.org, Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
 surenb@...gle.com, mhocko@...e.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, suzuki.poulose@....com, steven.price@....com,
 gshan@...hat.com, linux-arm-kernel@...ts.infradead.org,
 yang@...amperecomputing.com, ryan.roberts@....com, anshuman.khandual@....com
Subject: Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking


On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
> OK so I think the best solution here is to just update check_ops_valid(), which
> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> enforce the install pte thing).
>
> Let's do something like:
>
> #define OPS_MAY_INSTALL_PTE	(1<<0)
> #define OPS_MAY_AVOID_LOCK	(1<<1)
>
> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>
> Then check based on this e.g.:
>
> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> 	return false;
>
> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> 	return false;
>
> return true;
>
> Then update callers, most of whom can have '0' passed for this field, with
> walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
> walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
>
> That way we check it all in one place, it's consistent, we no long 'implicitly'
> don't check it for walk_page_range_mm() and all is neater.
>
> We do end up calling this predicate twice for walk_page_range(), so we could
> (should?)  also make walk_page_range_mm() into a static __walk_page_range_mm()
> and have a separate walk_page_range_mm() that does this check.
>
> I think this will solve the interface issues I've raised below.

Makes sense, thank you for your suggestions.

>
> Thanks!
>
> On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
>> arm64 currently changes permissions on vmalloc objects locklessly, via
>> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
>> since a limitation of the former is to deny changing permissions for block
>> mappings. However, the API currently enforces the init_mm.mmap_lock to be
>> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
>> this patch extends this generic API to be used locklessly, so as to retain
>> the existing behaviour for changing permissions. Apart from this reason,
>> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>>
>> Since such extension can potentially be dangerous for other callers
>> consuming the pagewalk API, explicitly disallow lockless traversal for
>> userspace pagetables by returning EINVAL. Add comments to highlight the
>> conditions under which we can use the API locklessly - no underlying VMA,
>> and the user having exclusive control over the range, thus guaranteeing no
>> concurrent access.
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>>   include/linux/pagewalk.h |  7 +++++++
>>   mm/pagewalk.c            | 23 ++++++++++++++++++-----
>>   2 files changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 8ac2f6d6d2a3..5efd6541239b 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,13 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/*
>> +	 * Walk without any lock. Use of this is only meant for the
>> +	 * case where there is no underlying VMA, and the user has
>> +	 * exclusive control over the range, guaranteeing no concurrent
>> +	 * access. For example, changing permissions of vmalloc objects.
>> +	 */
>> +	PGWALK_NOLOCK = 3,
> Thanks for the comment! This is good.
>
>>   };
>>
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index ff5299eca687..d55d933f84ec 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>   	return err;
>>   }
>>
>> -static inline void process_mm_walk_lock(struct mm_struct *mm,
>> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
>>   					enum page_walk_lock walk_lock)
> I don't like this signature at all, you don't describe what it does, and now it
> returns... whether it was not locked? I think this might lead to confusion :)
>
>
>>   {
>> +	if (walk_lock == PGWALK_NOLOCK)
>> +		return 1;
> It's 2025, return true please :)
>
>> +
>>   	if (walk_lock == PGWALK_RDLOCK)
>>   		mmap_assert_locked(mm);
>>   	else
>>   		mmap_assert_write_locked(mm);
>> +	return 0;
> It's 2025, return false please :)
>
>>   }
>>
>>   static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	case PGWALK_NOLOCK:
>> +		break;
> Under what circumstances would we be fine with this function being invoked with
> no lock being specified?
>
> Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
> won't ever invoke this? Or did I miss a call of this function?
>
> If not, we should make this a VM_WARN_ON_ONCE(1);

I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
Is there a similar problem with VM_WARN_ON_ONCE()?

[1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/

>
>>   	}
>>   #endif
>>   }
>> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>>   	if (!walk.mm)
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
> This is just weird, you're treating the return like it were an error value (no
> it's a boolean), the name of the function doesn't expliain the 'verb' of what's
> happening, it's just confusing.
>
> Obviously I'm belabouring the point a bit, see suggestion at top :)
>
>>   	vma = find_vma(walk.mm, start);
>>   	do {
>> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>   	 * to prevent the intermediate kernel pages tables belonging to the
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>> +	 *
>> +	 * If the caller can guarantee that it has exclusive access to the
>> +	 * specified address range, only then it can use PGWALK_NOLOCK.
>>   	 */
>> -	mmap_assert_locked(mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK)
>> +		mmap_assert_locked(mm);
>>
>>   	return walk_pgd_range(start, end, &walk);
>>   }
>> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>>   	if (!check_ops_valid(ops))
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
>>   	process_vma_walk_lock(vma, ops->walk_lock);
>>   	return __walk_page_range(start, end, &walk);
>>   }
>> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>   	if (!check_ops_valid(ops))
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
>>   	process_vma_walk_lock(vma, ops->walk_lock);
>>   	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>>   }
>> --
>> 2.30.2
>>
> Thanks!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ