linux-kernel - Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ce79c80-1fc8-4684-920a-c8d82c4c3dc8@intel.com>
Date: Thu, 7 Aug 2025 08:31:18 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Baolu Lu <baolu.lu@...ux.intel.com>, Jason Gunthorpe <jgg@...dia.com>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
 Robin Murphy <robin.murphy@....com>, Kevin Tian <kevin.tian@...el.com>,
 Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
 Alistair Popple <apopple@...dia.com>, Peter Zijlstra <peterz@...radead.org>,
 Uladzislau Rezki <urezki@...il.com>,
 Jean-Philippe Brucker <jean-philippe@...aro.org>,
 Andy Lutomirski <luto@...nel.org>, Yi Lai <yi1.lai@...el.com>,
 iommu@...ts.linux.dev, security@...nel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On 8/7/25 07:40, Baolu Lu wrote:
...
> I refactored the code above as follows. It compiles but hasn't been
> tested yet. Does it look good to you?

As in, it takes the non-compiling gunk I spewed into my email client and
makes it compile, yeah. Sure. ;)

> diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/
> pgalloc.h
> index c88691b15f3c..d9307dd09f67 100644
> --- a/arch/x86/include/asm/pgalloc.h
> +++ b/arch/x86/include/asm/pgalloc.h
> @@ -10,9 +10,11 @@
> 
>  #define __HAVE_ARCH_PTE_ALLOC_ONE
>  #define __HAVE_ARCH_PGD_FREE
> +#define __HAVE_ARCH_PTE_FREE_KERNEL

But I think it really muddies the waters down here.

This kinda reads like "x86 has its own per-arch pte_free_kernel() that
it always needs". Which is far from accurate.

> @@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct
> *vma, pud_t pud)
>      /* See note in arch_check_zapped_pte() */
>      VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud));
>  }
> +
> +static void kernel_pte_work_func(struct work_struct *work);
> +
> +static struct {
> +    struct list_head list;
> +    spinlock_t lock;
> +    struct work_struct work;
> +} kernel_pte_work = {
> +    .list = LIST_HEAD_INIT(kernel_pte_work.list),
> +    .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
> +    .work = __WORK_INITIALIZER(kernel_pte_work.work,
> kernel_pte_work_func),
> +};
> +
> +static void kernel_pte_work_func(struct work_struct *work)
> +{
> +    struct page *page, *next;
> +
> +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> +
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> +        list_del_init(&page->lru);
> +        pagetable_dtor_free(page_ptdesc(page));
> +    }
> +}
> +
> +/**
> + * pte_free_kernel - free PTE-level kernel page table memory
> + * @mm: the mm_struct of the current context
> + * @pte: pointer to the memory containing the page table
> + */

The kerneldoc here is just wasted bytes, IMNHO. Why not use those bytes
to actually explain what the heck is going on here?

> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> +{
> +    struct page *page = virt_to_page(pte);
> +
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_add(&page->lru, &kernel_pte_work.list);
> +    schedule_work(&kernel_pte_work.work);
> +}
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 3c8ec3bfea44..716ebab67636 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -46,6 +46,7 @@ static inline pte_t
> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>  #define pte_alloc_one_kernel(...)
> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
>  #endif
> 
> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
>  /**
>   * pte_free_kernel - free PTE-level kernel page table memory
>   * @mm: the mm_struct of the current context
> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct
> *mm, pte_t *pte)
>  {
>      pagetable_dtor_free(virt_to_ptdesc(pte));
>  }
> +#endif
> 
>  /**
>   * __pte_alloc_one - allocate memory for a PTE-level user page table

I'd much rather the arch-generic code looked like this:

#ifdef CONFIG_ASYNC_PGTABLE_FREE
// code and struct here, or dump them over in some
// other file and do this in a header
#else
static void pte_free_kernel_async(struct page *page) {}
#endif

void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
    struct page *page = virt_to_page(pte);

    if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
	pte_free_kernel_async(page);
    else
	pagetable_dtor_free(page_ptdesc(page));
}

Then in Kconfig, you end up with something like:

config ASYNC_PGTABLE_FREE
	def_bool y
	depends on INTEL_IOMMU_WHATEVER

That very much tells much more of the whole story in code. It also gives
the x86 folks that compile out the IOMMU the exact same code as the
arch-generic folks. It _also_ makes it dirt simple and obvious for the
x86 folks to optimize out the async behavior if they don't like it in
the future by replacing the compile-time IOMMU check with a runtime one.

Also, if another crazy IOMMU implementation comes along that happens to
do what the x86 IOMMUs do, then they have a single Kconfig switch to
flip. If they follow what this patch tries to do, they'll start by
copying and pasting the x86 implementation.