[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <db04bd02-0090-4aff-bb2e-0d1e023757a5@lucifer.local>
Date: Tue, 10 Jun 2025 13:07:00 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: akpm@...ux-foundation.org, david@...hat.com, catalin.marinas@....com,
will@...nel.org, Liam.Howlett@...cle.com, vbabka@...e.cz,
rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
suzuki.poulose@....com, steven.price@....com, gshan@...hat.com,
linux-arm-kernel@...ts.infradead.org, yang@...amperecomputing.com,
ryan.roberts@....com, anshuman.khandual@....com
Subject: Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
OK so I think the best solution here is to just update check_ops_valid(), which
was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
enforce the install pte thing).
Let's do something like:
#define OPS_MAY_INSTALL_PTE (1<<0)
#define OPS_MAY_AVOID_LOCK (1<<1)
and update check_ops_valid() to take a flags or maybe 'capabilities' field.
Then check based on this e.g.:
if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
return false;
if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
return false;
return true;
Then update callers, most of whom can have '0' passed for this field, with
walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
That way we check it all in one place, it's consistent, we no long 'implicitly'
don't check it for walk_page_range_mm() and all is neater.
We do end up calling this predicate twice for walk_page_range(), so we could
(should?) also make walk_page_range_mm() into a static __walk_page_range_mm()
and have a separate walk_page_range_mm() that does this check.
I think this will solve the interface issues I've raised below.
Thanks!
On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
> arm64 currently changes permissions on vmalloc objects locklessly, via
> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
> since a limitation of the former is to deny changing permissions for block
> mappings. However, the API currently enforces the init_mm.mmap_lock to be
> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
> this patch extends this generic API to be used locklessly, so as to retain
> the existing behaviour for changing permissions. Apart from this reason,
> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>
> Since such extension can potentially be dangerous for other callers
> consuming the pagewalk API, explicitly disallow lockless traversal for
> userspace pagetables by returning EINVAL. Add comments to highlight the
> conditions under which we can use the API locklessly - no underlying VMA,
> and the user having exclusive control over the range, thus guaranteeing no
> concurrent access.
>
> Signed-off-by: Dev Jain <dev.jain@....com>
> ---
> include/linux/pagewalk.h | 7 +++++++
> mm/pagewalk.c | 23 ++++++++++++++++++-----
> 2 files changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 8ac2f6d6d2a3..5efd6541239b 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -14,6 +14,13 @@ enum page_walk_lock {
> PGWALK_WRLOCK = 1,
> /* vma is expected to be already write-locked during the walk */
> PGWALK_WRLOCK_VERIFY = 2,
> + /*
> + * Walk without any lock. Use of this is only meant for the
> + * case where there is no underlying VMA, and the user has
> + * exclusive control over the range, guaranteeing no concurrent
> + * access. For example, changing permissions of vmalloc objects.
> + */
> + PGWALK_NOLOCK = 3,
Thanks for the comment! This is good.
> };
>
> /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index ff5299eca687..d55d933f84ec 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> return err;
> }
>
> -static inline void process_mm_walk_lock(struct mm_struct *mm,
> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
> enum page_walk_lock walk_lock)
I don't like this signature at all, you don't describe what it does, and now it
returns... whether it was not locked? I think this might lead to confusion :)
> {
> + if (walk_lock == PGWALK_NOLOCK)
> + return 1;
It's 2025, return true please :)
> +
> if (walk_lock == PGWALK_RDLOCK)
> mmap_assert_locked(mm);
> else
> mmap_assert_write_locked(mm);
> + return 0;
It's 2025, return false please :)
> }
>
> static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> case PGWALK_RDLOCK:
> /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
> break;
> + case PGWALK_NOLOCK:
> + break;
Under what circumstances would we be fine with this function being invoked with
no lock being specified?
Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
won't ever invoke this? Or did I miss a call of this function?
If not, we should make this a VM_WARN_ON_ONCE(1);
> }
> #endif
> }
> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> if (!walk.mm)
> return -EINVAL;
>
> - process_mm_walk_lock(walk.mm, ops->walk_lock);
> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> + return -EINVAL;
This is just weird, you're treating the return like it were an error value (no
it's a boolean), the name of the function doesn't expliain the 'verb' of what's
happening, it's just confusing.
Obviously I'm belabouring the point a bit, see suggestion at top :)
>
> vma = find_vma(walk.mm, start);
> do {
> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> * to prevent the intermediate kernel pages tables belonging to the
> * specified address range from being freed. The caller should take
> * other actions to prevent this race.
> + *
> + * If the caller can guarantee that it has exclusive access to the
> + * specified address range, only then it can use PGWALK_NOLOCK.
> */
> - mmap_assert_locked(mm);
> + if (ops->walk_lock != PGWALK_NOLOCK)
> + mmap_assert_locked(mm);
>
> return walk_pgd_range(start, end, &walk);
> }
> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> if (!check_ops_valid(ops))
> return -EINVAL;
>
> - process_mm_walk_lock(walk.mm, ops->walk_lock);
> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> + return -EINVAL;
> process_vma_walk_lock(vma, ops->walk_lock);
> return __walk_page_range(start, end, &walk);
> }
> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> if (!check_ops_valid(ops))
> return -EINVAL;
>
> - process_mm_walk_lock(walk.mm, ops->walk_lock);
> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> + return -EINVAL;
> process_vma_walk_lock(vma, ops->walk_lock);
> return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> }
> --
> 2.30.2
>
Thanks!
Powered by blists - more mailing lists