[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8077b321-8953-46aa-b06d-95c91823e6ce@lucifer.local>
Date: Fri, 13 Jun 2025 17:27:27 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: akpm@...ux-foundation.org, david@...hat.com, catalin.marinas@....com,
will@...nel.org, Liam.Howlett@...cle.com, vbabka@...e.cz,
rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
suzuki.poulose@....com, steven.price@....com, gshan@...hat.com,
linux-arm-kernel@...ts.infradead.org, yang@...amperecomputing.com,
ryan.roberts@....com, anshuman.khandual@....com
Subject: Re: [PATCH v3 1/2] arm64: pageattr: Use pagewalk API to change
memory permissions
On Fri, Jun 13, 2025 at 07:13:51PM +0530, Dev Jain wrote:
> arm64 currently changes permissions on vmalloc objects locklessly, via
> apply_to_page_range, whose limitation is to deny changing permissions for
> block mappings. Therefore, we move away to use the generic pagewalk API,
> thus paving the path for enabling huge mappings by default on kernel space
> mappings, thus leading to more efficient TLB usage. However, the API
> currently enforces the init_mm.mmap_lock to be held. To avoid the
> unnecessary bottleneck of the mmap_lock for our usecase, this patch
> extends this generic API to be used locklessly, so as to retain the
> existing behaviour for changing permissions. Apart from this reason, it is
> noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>
> Add comments to highlight the conditions under which we can use the
> lockless variant - no underlying VMA, and the user having exclusive control
> over the range, thus guaranteeing no concurrent access.
>
> Since arm64 cannot handle kernel live mapping splitting without BBML2,
> we require that the start and end of a given range lie on block mapping
> boundaries. Return -EINVAL in case a partial block mapping is detected;
> add a corresponding comment in ___change_memory_common() to warn that
> eliminating such a condition is the responsibility of the caller.
>
> apply_to_page_range() currently performs all pte level callbacks while in
> lazy mmu mode. Since arm64 can optimize performance by batching barriers
> when modifying kernel pgtables in lazy mmu mode, we would like to continue
> to benefit from this optimisation. Unfortunately walk_kernel_page_table_range()
> does not use lazy mmu mode. However, since the pagewalk framework is not
> allocating any memory, we can safely bracket the whole operation inside
> lazy mmu mode ourselves. Therefore, wrap the call to
> walk_kernel_page_table_range() with the lazy MMU helpers.
Thanks this is a great commit message!
>
> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>
> Signed-off-by: Dev Jain <dev.jain@....com>
> ---
> arch/arm64/mm/pageattr.c | 157 +++++++++++++++++++++++++++++++--------
> include/linux/pagewalk.h | 3 +
> mm/pagewalk.c | 26 +++++++
> 3 files changed, 154 insertions(+), 32 deletions(-)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 04d4a8f676db..cfc5279f27a2 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
> #include <linux/mem_encrypt.h>
> #include <linux/sched.h>
> #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>
> #include <asm/cacheflush.h>
> #include <asm/pgtable-prot.h>
> @@ -20,6 +21,99 @@ struct page_change_data {
> pgprot_t clear_mask;
> };
>
> +static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
> +{
> + struct page_change_data *masks = walk->private;
> +
> + val &= ~(pgprot_val(masks->clear_mask));
> + val |= (pgprot_val(masks->set_mask));
> +
> + return val;
> +}
> +
> +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pgd_t val = pgdp_get(pgd);
> +
> + if (pgd_leaf(val)) {
Does arm have PGD entry level leaves? :) just curious :P
> + if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
> + return -EINVAL;
I guess the point here is to assert that the searched range _entirely
spans_ the folio that the higher order leaf page table entry describes.
I'm guessing this is desired.
But I'm not sure this should be a warning?
What if you happen to walk a range that isn't aligned like this?
(Same comment goes for the other instances of the same pattern)
> + val = __pgd(set_pageattr_masks(pgd_val(val), walk));
> + set_pgd(pgd, val);
> + walk->action = ACTION_CONTINUE;
> + }
> +
> + return 0;
> +}
> +
> +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + p4d_t val = p4dp_get(p4d);
> +
> + if (p4d_leaf(val)) {
> + if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
> + return -EINVAL;
> + val = __p4d(set_pageattr_masks(p4d_val(val), walk));
> + set_p4d(p4d, val);
> + walk->action = ACTION_CONTINUE;
> + }
> +
> + return 0;
> +}
> +
> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pud_t val = pudp_get(pud);
> +
> + if (pud_leaf(val)) {
> + if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> + return -EINVAL;
> + val = __pud(set_pageattr_masks(pud_val(val), walk));
> + set_pud(pud, val);
> + walk->action = ACTION_CONTINUE;
> + }
> +
> + return 0;
> +}
> +
> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pmd_t val = pmdp_get(pmd);
> +
> + if (pmd_leaf(val)) {
> + if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> + return -EINVAL;
> + val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> + set_pmd(pmd, val);
> + walk->action = ACTION_CONTINUE;
> + }
> +
> + return 0;
> +}
> +
> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pte_t val = __ptep_get(pte);
> +
> + val = __pte(set_pageattr_masks(pte_val(val), walk));
> + __set_pte(pte, val);
> +
> + return 0;
> +}
> +
> +static const struct mm_walk_ops pageattr_ops = {
> + .pgd_entry = pageattr_pgd_entry,
> + .p4d_entry = pageattr_p4d_entry,
> + .pud_entry = pageattr_pud_entry,
> + .pmd_entry = pageattr_pmd_entry,
> + .pte_entry = pageattr_pte_entry,
> +};
> +
> bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>
> bool can_set_direct_map(void)
> @@ -37,22 +131,7 @@ bool can_set_direct_map(void)
> arm64_kfence_can_set_direct_map() || is_realm_world();
> }
>
> -static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
> -{
> - struct page_change_data *cdata = data;
> - pte_t pte = __ptep_get(ptep);
> -
> - pte = clear_pte_bit(pte, cdata->clear_mask);
> - pte = set_pte_bit(pte, cdata->set_mask);
> -
> - __set_pte(ptep, pte);
> - return 0;
> -}
> -
> -/*
> - * This function assumes that the range is mapped with PAGE_SIZE pages.
> - */
> -static int __change_memory_common(unsigned long start, unsigned long size,
> +static int ___change_memory_common(unsigned long start, unsigned long size,
> pgprot_t set_mask, pgprot_t clear_mask)
I wouldn't presume to comment on conventions in arm64 arch code, but that's
a lot of underscores :P
I wonder if this is the best name for it as you're not only invoking it
from __change_memory_common()?
And yes this is a pedantic comment, I realise :)
> {
> struct page_change_data data;
> @@ -61,9 +140,28 @@ static int __change_memory_common(unsigned long start, unsigned long size,
> data.set_mask = set_mask;
> data.clear_mask = clear_mask;
>
> - ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> - &data);
> + arch_enter_lazy_mmu_mode();
> +
> + /*
> + * The caller must ensure that the range we are operating on does not
> + * partially overlap a block mapping. Any such case should either not
> + * exist, or must be eliminated by splitting the mapping - which for
> + * kernel mappings can be done only on BBML2 systems.
> + *
EVEN MORE RIDICULOUS NIT: extra line in comment here!
> + */
OK so this probably answers the comment above re: the warnings.
> + ret = walk_kernel_page_table_range_lockless(start, start + size,
> + &pageattr_ops, NULL, &data);
> + arch_leave_lazy_mmu_mode();
> +
> + return ret;
> +}
>
> +static int __change_memory_common(unsigned long start, unsigned long size,
> + pgprot_t set_mask, pgprot_t clear_mask)
> +{
> + int ret;
> +
> + ret = ___change_memory_common(start, size, set_mask, clear_mask);
> /*
> * If the memory is being made valid without changing any other bits
> * then a TLBI isn't required as a non-valid entry cannot be cached in
> @@ -71,6 +169,7 @@ static int __change_memory_common(unsigned long start, unsigned long size,
> */
> if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
> flush_tlb_kernel_range(start, start + size);
> +
> return ret;
> }
>
> @@ -174,32 +273,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>
> int set_direct_map_invalid_noflush(struct page *page)
> {
> - struct page_change_data data = {
> - .set_mask = __pgprot(0),
> - .clear_mask = __pgprot(PTE_VALID),
> - };
> + pgprot_t clear_mask = __pgprot(PTE_VALID);
> + pgprot_t set_mask = __pgprot(0);
>
> if (!can_set_direct_map())
> return 0;
>
> - return apply_to_page_range(&init_mm,
> - (unsigned long)page_address(page),
> - PAGE_SIZE, change_page_range, &data);
> + return ___change_memory_common((unsigned long)page_address(page),
> + PAGE_SIZE, set_mask, clear_mask);
> }
>
> int set_direct_map_default_noflush(struct page *page)
> {
> - struct page_change_data data = {
> - .set_mask = __pgprot(PTE_VALID | PTE_WRITE),
> - .clear_mask = __pgprot(PTE_RDONLY),
> - };
> + pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
> + pgprot_t clear_mask = __pgprot(PTE_RDONLY);
>
> if (!can_set_direct_map())
> return 0;
>
> - return apply_to_page_range(&init_mm,
> - (unsigned long)page_address(page),
> - PAGE_SIZE, change_page_range, &data);
> + return ___change_memory_common((unsigned long)page_address(page),
> + PAGE_SIZE, set_mask, clear_mask);
> }
>
> static int __set_memory_enc_dec(unsigned long addr,
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 8ac2f6d6d2a3..79ab8c754dff 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -132,6 +132,9 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> int walk_kernel_page_table_range(unsigned long start,
> unsigned long end, const struct mm_walk_ops *ops,
> pgd_t *pgd, void *private);
> +int walk_kernel_page_table_range_lockless(unsigned long start,
> + unsigned long end, const struct mm_walk_ops *ops,
> + pgd_t *pgd, void *private);
> int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, const struct mm_walk_ops *ops,
> void *private);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index ff5299eca687..7446984b2154 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -632,6 +632,32 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> return walk_pgd_range(start, end, &walk);
> }
>
> +/*
> + * Use this function to walk the kernel page tables locklessly. It should be
> + * guaranteed that the caller has exclusive access over the range they are
> + * operating on - that there should be no concurrent access, for example,
> + * changing permissions for vmalloc objects.
> + */
> +int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
> + const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
Don't really see the point in having a pgd parameter?
> +{
> + struct mm_struct *mm = &init_mm;
No real need to split out mm here, just reference &init_mm direct below.
> + struct mm_walk walk = {
> + .ops = ops,
> + .mm = mm,
> + .pgd = pgd,
> + .private = private,
> + .no_vma = true
> + };
> +
> + if (start >= end)
> + return -EINVAL;
> + if (!check_ops_valid(ops))
> + return -EINVAL;
> +
> + return walk_pgd_range(start, end, &walk);
> +}
> +
> /**
> * walk_page_range_debug - walk a range of pagetables not backed by a vma
> * @mm: mm_struct representing the target process of page table walk
> --
> 2.30.2
>
Other than nits, etc. this looks fine. Thanks for the refactorings!
Powered by blists - more mailing lists