linux-kernel - Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250710135432.GO1613376@noisy.programming.kicks-ass.net>
Date: Thu, 10 Jul 2025 15:54:32 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Lu Baolu <baolu.lu@...ux.intel.com>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
	Robin Murphy <robin.murphy@....com>,
	Kevin Tian <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
	Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
	Dave Hansen <dave.hansen@...el.com>,
	Alistair Popple <apopple@...dia.com>,
	Uladzislau Rezki <urezki@...il.com>,
	Jean-Philippe Brucker <jean-philippe@...aro.org>,
	Andy Lutomirski <luto@...nel.org>,
	"Tested-by : Yi Lai" <yi1.lai@...el.com>, iommu@...ts.linux.dev,
	security@...nel.org, linux-kernel@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
> The vmalloc() and vfree() functions manage virtually contiguous, but not
> necessarily physically contiguous, kernel memory regions. When vfree()
> unmaps such a region, it tears down the associated kernel page table
> entries and frees the physical pages.
> 
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. Architectures like x86 share
> static kernel address mappings across all user page tables, allowing the
> IOMMU to access the kernel portion of these tables.
> 
> Modern IOMMUs often cache page table entries to optimize walk performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
> 
> To mitigate this, introduce a new iommu interface to flush IOMMU caches
> and fence pending page table walks when kernel page mappings are updated.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables.

I must say I liked the kPTI based idea better. Having to iterate and
invalidate an unspecified number of IOMMUs from non-preemptible context
seems 'unfortunate'.

Why was this approach chosen over the kPTI one, where we keep a
page-table root that simply does not include the kernel bits, and
therefore the IOMMU will never see them (change) and we'll never have to
invalidate?

> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>  	if (ret)
>  		goto out_free_domain;
>  	domain->users = 1;
> -	list_add(&domain->next, &mm->iommu_mm->sva_domains);
>  
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> +			if (list_empty(&iommu_sva_mms))
> +				static_branch_enable(&iommu_sva_present);
> +			list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> +		}
> +	}
> +	list_add(&domain->next, &iommu_mm->sva_domains);
>  out:
>  	refcount_set(&handle->users, 1);
>  	mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>  		list_del(&domain->next);
>  		iommu_domain_free(domain);
>  	}
> +
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> +			list_del(&iommu_mm->mm_list_elm);
> +			if (list_empty(&iommu_sva_mms))
> +				static_branch_disable(&iommu_sva_present);
> +		}
> +	}
> +
>  	mutex_unlock(&iommu_sva_lock);
>  	kfree(handle);
>  }

This seems an odd coding style choice; why the extra unneeded
indentation? That is, what's wrong with:

	if (list_empty()) {
		guard(spinlock_irqsave)(&iommu_mms_lock);
		list_del();
		if (list_empty()
			static_branch_disable();
	}

> @@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  
>  	return domain;
>  }
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> +	struct iommu_mm_data *iommu_mm;
> +
> +	if (!static_branch_unlikely(&iommu_sva_present))
> +		return;
> +
> +	guard(spinlock_irqsave)(&iommu_mms_lock);
> +	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> +		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
> +}

This is absolutely the wrong way to use static_branch. You want them in
inline functions guarding the function call, not inside the function
call.