[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250710135432.GO1613376@noisy.programming.kicks-ass.net>
Date: Thu, 10 Jul 2025 15:54:32 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Lu Baolu <baolu.lu@...ux.intel.com>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
Robin Murphy <robin.murphy@....com>,
Kevin Tian <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
Dave Hansen <dave.hansen@...el.com>,
Alistair Popple <apopple@...dia.com>,
Uladzislau Rezki <urezki@...il.com>,
Jean-Philippe Brucker <jean-philippe@...aro.org>,
Andy Lutomirski <luto@...nel.org>,
"Tested-by : Yi Lai" <yi1.lai@...el.com>, iommu@...ts.linux.dev,
security@...nel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB
flush
On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
> The vmalloc() and vfree() functions manage virtually contiguous, but not
> necessarily physically contiguous, kernel memory regions. When vfree()
> unmaps such a region, it tears down the associated kernel page table
> entries and frees the physical pages.
>
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. Architectures like x86 share
> static kernel address mappings across all user page tables, allowing the
> IOMMU to access the kernel portion of these tables.
>
> Modern IOMMUs often cache page table entries to optimize walk performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
>
> To mitigate this, introduce a new iommu interface to flush IOMMU caches
> and fence pending page table walks when kernel page mappings are updated.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables.
I must say I liked the kPTI based idea better. Having to iterate and
invalidate an unspecified number of IOMMUs from non-preemptible context
seems 'unfortunate'.
Why was this approach chosen over the kPTI one, where we keep a
page-table root that simply does not include the kernel bits, and
therefore the IOMMU will never see them (change) and we'll never have to
invalidate?
> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> if (ret)
> goto out_free_domain;
> domain->users = 1;
> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + if (list_empty(&iommu_sva_mms))
> + static_branch_enable(&iommu_sva_present);
> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> + }
> + }
> + list_add(&domain->next, &iommu_mm->sva_domains);
> out:
> refcount_set(&handle->users, 1);
> mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> list_del(&domain->next);
> iommu_domain_free(domain);
> }
> +
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + list_del(&iommu_mm->mm_list_elm);
> + if (list_empty(&iommu_sva_mms))
> + static_branch_disable(&iommu_sva_present);
> + }
> + }
> +
> mutex_unlock(&iommu_sva_lock);
> kfree(handle);
> }
This seems an odd coding style choice; why the extra unneeded
indentation? That is, what's wrong with:
if (list_empty()) {
guard(spinlock_irqsave)(&iommu_mms_lock);
list_del();
if (list_empty()
static_branch_disable();
}
> @@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>
> return domain;
> }
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> + struct iommu_mm_data *iommu_mm;
> +
> + if (!static_branch_unlikely(&iommu_sva_present))
> + return;
> +
> + guard(spinlock_irqsave)(&iommu_mms_lock);
> + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
> +}
This is absolutely the wrong way to use static_branch. You want them in
inline functions guarding the function call, not inside the function
call.
Powered by blists - more mailing lists