[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f0d561a1-231d-495e-a91a-9724d4037f05@linux.intel.com>
Date: Thu, 7 Aug 2025 14:53:46 +0800
From: Baolu Lu <baolu.lu@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>, Joerg Roedel <joro@...tes.org>,
Will Deacon <will@...nel.org>, Robin Murphy <robin.murphy@....com>,
Kevin Tian <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
Alistair Popple <apopple@...dia.com>, Peter Zijlstra <peterz@...radead.org>,
Uladzislau Rezki <urezki@...il.com>,
Jean-Philippe Brucker <jean-philippe@...aro.org>,
Andy Lutomirski <luto@...nel.org>, Yi Lai <yi1.lai@...el.com>
Cc: iommu@...ts.linux.dev, security@...nel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
flush
On 8/6/25 23:03, Dave Hansen wrote:
> On 8/5/25 22:25, Lu Baolu wrote:
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. The Linux x86 architecture maps
>> the kernel address space into the upper portion of every process’s page
>> table. Consequently, in an SVA context, the IOMMU hardware can walk and
>> cache kernel space mappings. However, the Linux kernel currently lacks
>> a notification mechanism for kernel space mapping changes. This means
>> the IOMMU driver is not aware of such changes, leading to a break in
>> IOMMU cache coherence.
> FWIW, I wouldn't use the term "cache coherence" in this context. I'd
> probably just call them "stale IOTLB entries".
>
> I also think this over states the problem. There is currently no problem
> with "kernel space mapping changes". The issue is solely when kernel
> page table pages are freed and reused.
>
>> Modern IOMMUs often cache page table entries of the intermediate-level
>> page table as long as the entry is valid, no matter the permissions, to
>> optimize walk performance. Currently the iommu driver is notified only
>> for changes of user VA mappings, so the IOMMU's internal caches may
>> retain stale entries for kernel VA. When kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises.
>>
>> If these freed page table pages are reallocated for a different purpose,
>> potentially by an attacker, the IOMMU could misinterpret the new data as
>> valid page table entries. This allows the IOMMU to walk into attacker-
>> controlled memory, leading to arbitrary physical memory DMA access or
>> privilege escalation.
> Note that it's not just use-after-free. It's literally that the IOMMU
> will keep writing Accessed and Dirty bits while it thinks the page is
> still a page table. The IOMMU will sit there happily setting bits. So,
> it's_write_ after free too.
>
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables, whenever a kernel page table
>> update is done and the CPU TLB needs to be flushed.
> There's one tidbit missing from this:
>
> Currently SVA contexts are all unprivileged. They can only
> access user mappings and never kernel mappings. However, IOMMU
> still walk kernel-only page tables all the way down to the leaf
> where they realize that the entry is a kernel mapping and error
> out.
Thank you for the guidance. I will improve the commit message
accordingly, as follows.
iommu/sva: Invalidate stale IOTLB entries for kernel address space
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically when a kernel page table update requires a CPU TLB flush.
Thanks,
baolu
Powered by blists - more mailing lists