lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f0d561a1-231d-495e-a91a-9724d4037f05@linux.intel.com>
Date: Thu, 7 Aug 2025 14:53:46 +0800
From: Baolu Lu <baolu.lu@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>, Joerg Roedel <joro@...tes.org>,
 Will Deacon <will@...nel.org>, Robin Murphy <robin.murphy@....com>,
 Kevin Tian <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
 Jann Horn <jannh@...gle.com>, Vasant Hegde <vasant.hegde@....com>,
 Alistair Popple <apopple@...dia.com>, Peter Zijlstra <peterz@...radead.org>,
 Uladzislau Rezki <urezki@...il.com>,
 Jean-Philippe Brucker <jean-philippe@...aro.org>,
 Andy Lutomirski <luto@...nel.org>, Yi Lai <yi1.lai@...el.com>
Cc: iommu@...ts.linux.dev, security@...nel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On 8/6/25 23:03, Dave Hansen wrote:
> On 8/5/25 22:25, Lu Baolu wrote:
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. The Linux x86 architecture maps
>> the kernel address space into the upper portion of every process’s page
>> table. Consequently, in an SVA context, the IOMMU hardware can walk and
>> cache kernel space mappings. However, the Linux kernel currently lacks
>> a notification mechanism for kernel space mapping changes. This means
>> the IOMMU driver is not aware of such changes, leading to a break in
>> IOMMU cache coherence.
> FWIW, I wouldn't use the term "cache coherence" in this context. I'd
> probably just call them "stale IOTLB entries".
> 
> I also think this over states the problem. There is currently no problem
> with "kernel space mapping changes". The issue is solely when kernel
> page table pages are freed and reused.
> 
>> Modern IOMMUs often cache page table entries of the intermediate-level
>> page table as long as the entry is valid, no matter the permissions, to
>> optimize walk performance. Currently the iommu driver is notified only
>> for changes of user VA mappings, so the IOMMU's internal caches may
>> retain stale entries for kernel VA. When kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises.
>>
>> If these freed page table pages are reallocated for a different purpose,
>> potentially by an attacker, the IOMMU could misinterpret the new data as
>> valid page table entries. This allows the IOMMU to walk into attacker-
>> controlled memory, leading to arbitrary physical memory DMA access or
>> privilege escalation.
> Note that it's not just use-after-free. It's literally that the IOMMU
> will keep writing Accessed and Dirty bits while it thinks the page is
> still a page table. The IOMMU will sit there happily setting bits. So,
> it's_write_ after free too.
> 
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables, whenever a kernel page table
>> update is done and the CPU TLB needs to be flushed.
> There's one tidbit missing from this:
> 
> 	Currently SVA contexts are all unprivileged. They can only
> 	access user mappings and never kernel mappings. However, IOMMU
> 	still walk kernel-only page tables all the way down to the leaf
> 	where they realize that the entry is a kernel mapping and error
> 	out.

Thank you for the guidance. I will improve the commit message
accordingly, as follows.

iommu/sva: Invalidate stale IOTLB entries for kernel address space

In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.

The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.

A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.

Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.

To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically when a kernel page table update requires a CPU TLB flush.

Thanks,
baolu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ