linux-kernel - Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d646d434-f680-47a3-b6b9-26f4538c1209@intel.com>
Date: Wed, 6 Aug 2025 08:03:42 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Lu Baolu <baolu.lu@...ux.intel.com>, Joerg Roedel <joro@...tes.org>,
 Will Deacon <will@...nel.org>, Robin Murphy <robin.murphy@....com>,
 Kevin Tian <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
 Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
 Alistair Popple <apopple@...dia.com>, Peter Zijlstra <peterz@...radead.org>,
 Uladzislau Rezki <urezki@...il.com>,
 Jean-Philippe Brucker <jean-philippe@...aro.org>,
 Andy Lutomirski <luto@...nel.org>, Yi Lai <yi1.lai@...el.com>
Cc: iommu@...ts.linux.dev, security@...nel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On 8/5/25 22:25, Lu Baolu wrote:
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. The Linux x86 architecture maps
> the kernel address space into the upper portion of every process’s page
> table. Consequently, in an SVA context, the IOMMU hardware can walk and
> cache kernel space mappings. However, the Linux kernel currently lacks
> a notification mechanism for kernel space mapping changes. This means
> the IOMMU driver is not aware of such changes, leading to a break in
> IOMMU cache coherence.

FWIW, I wouldn't use the term "cache coherence" in this context. I'd
probably just call them "stale IOTLB entries".

I also think this over states the problem. There is currently no problem
with "kernel space mapping changes". The issue is solely when kernel
page table pages are freed and reused.

> Modern IOMMUs often cache page table entries of the intermediate-level
> page table as long as the entry is valid, no matter the permissions, to
> optimize walk performance. Currently the iommu driver is notified only
> for changes of user VA mappings, so the IOMMU's internal caches may
> retain stale entries for kernel VA. When kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises.
> 
> If these freed page table pages are reallocated for a different purpose,
> potentially by an attacker, the IOMMU could misinterpret the new data as
> valid page table entries. This allows the IOMMU to walk into attacker-
> controlled memory, leading to arbitrary physical memory DMA access or
> privilege escalation.

Note that it's not just use-after-free. It's literally that the IOMMU
will keep writing Accessed and Dirty bits while it thinks the page is
still a page table. The IOMMU will sit there happily setting bits. So,
it's _write_ after free too.

> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables, whenever a kernel page table
> update is done and the CPU TLB needs to be flushed.

There's one tidbit missing from this:

	Currently SVA contexts are all unprivileged. They can only
	access user mappings and never kernel mappings. However, IOMMU
	still walk kernel-only page tables all the way down to the leaf
	where they realize that the entry is a kernel mapping and error
	out.


> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..3b85e7d3ba44 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -12,6 +12,7 @@
>  #include <linux/task_work.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/mmu_context.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
>  	else
>  		/* Fall back to the IPI-based invalidation. */
>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
> +
> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>  }
>  
>  /* Flush an arbitrarily large range of memory with INVLPGB. */
> @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  		kernel_tlb_flush_range(info);
>  
>  	put_flush_tlb_info();
> +	iommu_sva_invalidate_kva_range(start, end);
>  }

These desperately need to be commented. They especially need comments
that they are solely for flushing the IOMMU mid-level paging structures
after freeing a page table page.

>  /*
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 1a51cfd82808..d0da2b3fd64b 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -10,6 +10,8 @@
>  #include "iommu-priv.h"
>  
>  static DEFINE_MUTEX(iommu_sva_lock);
> +static bool iommu_sva_present;
> +static LIST_HEAD(iommu_sva_mms);
>  static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  						   struct mm_struct *mm);
>  
> @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	iommu_mm->pasid = pasid;
> +	iommu_mm->mm = mm;
>  	INIT_LIST_HEAD(&iommu_mm->sva_domains);
>  	/*
>  	 * Make sure the write to mm->iommu_mm is not reordered in front of
> @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>  	if (ret)
>  		goto out_free_domain;
>  	domain->users = 1;
> -	list_add(&domain->next, &mm->iommu_mm->sva_domains);
>  
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, true);
> +		list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> +	}
> +	list_add(&domain->next, &iommu_mm->sva_domains);
>  out:
>  	refcount_set(&handle->users, 1);
>  	mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>  		list_del(&domain->next);
>  		iommu_domain_free(domain);
>  	}
> +
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		list_del(&iommu_mm->mm_list_elm);
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, false);
> +	}
> +
>  	mutex_unlock(&iommu_sva_lock);
>  	kfree(handle);
>  }
> @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  
>  	return domain;
>  }
> +
> +struct kva_invalidation_work_data {
> +	struct work_struct work;
> +	unsigned long start;
> +	unsigned long end;
> +};
> +
> +static void invalidate_kva_func(struct work_struct *work)
> +{
> +	struct kva_invalidation_work_data *data =
> +		container_of(work, struct kva_invalidation_work_data, work);
> +	struct iommu_mm_data *iommu_mm;
> +
> +	guard(mutex)(&iommu_sva_lock);
> +	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> +		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
> +				data->start, data->end);
> +
> +	kfree(data);
> +}
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> +	struct kva_invalidation_work_data *data;
> +
> +	if (likely(!READ_ONCE(iommu_sva_present)))
> +		return;

It would be nice to hear a few more words about why this is safe without
a lock.

> +	/* will be freed in the task function */
> +	data = kzalloc(sizeof(*data), GFP_ATOMIC);
> +	if (!data)
> +		return;
> +
> +	data->start = start;
> +	data->end = end;
> +	INIT_WORK(&data->work, invalidate_kva_func);
> +
> +	/*
> +	 * Since iommu_sva_mms is an unbound list, iterating it in an atomic
> +	 * context could introduce significant latency issues.
> +	 */
> +	schedule_work(&data->work);
> +}

Hold on a sec, though. the problematic caller of this looks something
like this (logically):

	pmd_free_pte_page()
	{
	        pte = pmd_page_vaddr(*pmd);
	        pmd_clear(pmd);
		flush_tlb_kernel_range(...); // does schedule_work()
		pte_free_kernel(pte);
	}

It _immediately_ frees the PTE page. The schedule_work() work will
probably happen sometime after the page is freed.

Isn't that still a use-after-free? It's for some arbitrary amount of
time and better than before but it's still a use-after-free.