[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ddd39c66-81f1-44bc-95c7-832d5aff8a85@oracle.com>
Date: Tue, 7 Oct 2025 10:41:01 -0400
From: Alejandro Jimenez <alejandro.j.jimenez@...cle.com>
To: Alex Mastro <amastro@...com>
Cc: Jason Gunthorpe <jgg@...pe.ca>,
Alex Williamson <alex.williamson@...hat.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would
overflow u64
On 10/7/25 12:24 AM, Alex Mastro wrote:
> On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:
>> If going this way, we'd also have to deny the MAP requests. Right now we
>
> Agree.
>
>>> I have doubts that anyone actually relies on MAP_DMA-ing such
>>> end-of-u64-mappings in practice, so perhaps it's OK?
>>>
>>
>> The AMD IOMMU supports a 64-bit IOVA, so when using the AMD vIOMMU with DMA
>> remapping enabled + VF passthrough + a Linux guest with iommu.forcedac=1 we
>> hit this issue since the driver (mlx5) starts requesting mappings for IOVAs
>> right at the top of the address space.
>
> Interesting!
>
>> The reason why I hadn't send it to the list yet is because I noticed the
>> additional locations Jason mentioned earlier in the thread (e.g.
>> vfio_find_dma(), vfio_iommu_replay()) and wanted to put together a
>> reproducer that also triggered those paths.
>
> I am not well equipped to test some of these paths, so if you have a recipe, I'd
> be interested.
>
Unfortunately I didn't find a good recipe yet, which is why I delayed
sending my patch with the fix.
I tried a few things to trigger the code paths, but the "easy way"
methods are not enough. e.g. vfio_iommu_replay() is called when enabling
the IOMMU model on the container i.e. when issuing the VFIO_SET_IOMMU
ioctl but at that time there are no mappings yet for the domain so there
is no dma_list to traverse.
I also tried attaching a second group to the same container that already
contained a group with mappings, but in vfio_iommu_type1_attach_group()
the new group is attached to the "existing compatible domain" that was
already created for the first, since mappings are per-domain, so there
is no need for replay.
Just mentioning this to give an idea, I didn't have time to try
triggering the paths involved with dirty tracking...
The basic reproducer I linked on my original patch:
https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8
is only exercising vfio_find_dma_first_node() and vfio_dma_do_unmap(),
which is enough to verify the fix that was causing the AMD vIOMMU failure.
>> I mentioned in the notes for the patch above why I chose a slightly more
>> complex method than the '- 1' approach, since there is a chance that
>> iova+size could also go beyond the end of the address space and actually
>> wrap around.
>
> I think certain invariants have broken if that is possible. The current checks
> in the unmap path should prevent that (iova + size - 1 < iova).
>
> 1330 } else if (!size || size & (pgsize - 1) ||
> 1331 iova + size - 1 < iova || size > SIZE_MAX) {
> 1332 goto unlock;
>
Yes, that check properly accounts for overflow, but later
vfio_find_dma_first_node() doesn't, so it won't find the node in the
dma_list. I agree with the approach of sanitizing the inputs at the uapi
boundary as Jason mentions, but the code should be robust enough to
handle overflow anyways I'd think.
>> My goal was not to trust the inputs at all if possible. We could also use
>> check_add_overflow() if we want to add explicit error reporting in
>> vfio_iommu_type1 when an overflow is detected. i.e. catch bad callers that
>
> I do like the explicitness of the check_* functions over the older style wrap
> checks.
>
> Below is my partially validated attempt at a more comprehensive fix if we were
> to try and make end of address space mappings work, rather than blanking out
> the last page.
>
In my opinion, the latter that is not optional. Ranges ending at the
64-bit boundary are valid, and not accepting them will break drivers
that request them.
I'll test this change when I have a chance, though I'd really like to
have solid test cases so I am hoping there is some advice on how to
trigger all the paths.
Thank you,
Alejandro
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 08242d8ce2ca..66a25de35446 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -28,6 +28,7 @@
> #include <linux/iommu.h>
> #include <linux/module.h>
> #include <linux/mm.h>
> +#include <linux/overflow.h>
> #include <linux/kthread.h>
> #include <linux/rbtree.h>
> #include <linux/sched/signal.h>
> @@ -165,12 +166,14 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> {
> struct rb_node *node = iommu->dma_list.rb_node;
>
> + BUG_ON(size == 0);
> +
> while (node) {
> struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>
> - if (start + size <= dma->iova)
> + if (start + size - 1 < dma->iova)
> node = node->rb_left;
> - else if (start >= dma->iova + dma->size)
> + else if (start > dma->iova + dma->size - 1)
> node = node->rb_right;
> else
> return dma;
> @@ -186,10 +189,12 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
> struct rb_node *node = iommu->dma_list.rb_node;
> struct vfio_dma *dma_res = NULL;
>
> + BUG_ON(size == 0);
> +
> while (node) {
> struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>
> - if (start < dma->iova + dma->size) {
> + if (start <= dma->iova + dma->size - 1) {
> res = node;
> dma_res = dma;
> if (start >= dma->iova)
> @@ -199,7 +204,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
> node = node->rb_right;
> }
> }
> - if (res && size && dma_res->iova > start + size - 1)
> + if (res && dma_res->iova > start + size - 1)
> res = NULL;
> return res;
> }
> @@ -213,7 +218,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
> parent = *link;
> dma = rb_entry(parent, struct vfio_dma, node);
>
> - if (new->iova + new->size <= dma->iova)
> + if (new->iova + new->size - 1 < dma->iova)
> link = &(*link)->rb_left;
> else
> link = &(*link)->rb_right;
> @@ -825,14 +830,24 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> unsigned long remote_vaddr;
> struct vfio_dma *dma;
> bool do_accounting;
> + u64 end, to_pin;
>
> - if (!iommu || !pages)
> + if (!iommu || !pages || npage < 0)
> return -EINVAL;
>
> /* Supported for v2 version only */
> if (!iommu->v2)
> return -EACCES;
>
> + if (npage == 0)
> + return 0;
> +
> + if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
> + return -EINVAL;
> +
> + if (check_add_overflow(user_iova, to_pin - 1, &end))
> + return -EINVAL;
> +
> mutex_lock(&iommu->lock);
>
> if (WARN_ONCE(iommu->vaddr_invalid_count,
> @@ -997,7 +1012,7 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
> #define VFIO_IOMMU_TLB_SYNC_MAX 512
>
> static size_t unmap_unpin_fast(struct vfio_domain *domain,
> - struct vfio_dma *dma, dma_addr_t *iova,
> + struct vfio_dma *dma, dma_addr_t iova,
> size_t len, phys_addr_t phys, long *unlocked,
> struct list_head *unmapped_list,
> int *unmapped_cnt,
> @@ -1007,18 +1022,16 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
> struct vfio_regions *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>
> if (entry) {
> - unmapped = iommu_unmap_fast(domain->domain, *iova, len,
> + unmapped = iommu_unmap_fast(domain->domain, iova, len,
> iotlb_gather);
>
> if (!unmapped) {
> kfree(entry);
> } else {
> - entry->iova = *iova;
> + entry->iova = iova;
> entry->phys = phys;
> entry->len = unmapped;
> list_add_tail(&entry->list, unmapped_list);
> -
> - *iova += unmapped;
> (*unmapped_cnt)++;
> }
> }
> @@ -1037,18 +1050,17 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
> }
>
> static size_t unmap_unpin_slow(struct vfio_domain *domain,
> - struct vfio_dma *dma, dma_addr_t *iova,
> + struct vfio_dma *dma, dma_addr_t iova,
> size_t len, phys_addr_t phys,
> long *unlocked)
> {
> - size_t unmapped = iommu_unmap(domain->domain, *iova, len);
> + size_t unmapped = iommu_unmap(domain->domain, iova, len);
>
> if (unmapped) {
> - *unlocked += vfio_unpin_pages_remote(dma, *iova,
> + *unlocked += vfio_unpin_pages_remote(dma, iova,
> phys >> PAGE_SHIFT,
> unmapped >> PAGE_SHIFT,
> false);
> - *iova += unmapped;
> cond_resched();
> }
> return unmapped;
> @@ -1057,12 +1069,12 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
> static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> bool do_accounting)
> {
> - dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> struct vfio_domain *domain, *d;
> LIST_HEAD(unmapped_region_list);
> struct iommu_iotlb_gather iotlb_gather;
> int unmapped_region_cnt = 0;
> long unlocked = 0;
> + size_t pos = 0;
>
> if (!dma->size)
> return 0;
> @@ -1086,13 +1098,14 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> }
>
> iommu_iotlb_gather_init(&iotlb_gather);
> - while (iova < end) {
> + while (pos < dma->size) {
> size_t unmapped, len;
> phys_addr_t phys, next;
> + dma_addr_t iova = dma->iova + pos;
>
> phys = iommu_iova_to_phys(domain->domain, iova);
> if (WARN_ON(!phys)) {
> - iova += PAGE_SIZE;
> + pos += PAGE_SIZE;
> continue;
> }
>
> @@ -1101,7 +1114,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> * may require hardware cache flushing, try to find the
> * largest contiguous physical memory chunk to unmap.
> */
> - for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
> + for (len = PAGE_SIZE; len + pos < dma->size; len += PAGE_SIZE) {
> next = iommu_iova_to_phys(domain->domain, iova + len);
> if (next != phys + len)
> break;
> @@ -1111,16 +1124,18 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> * First, try to use fast unmap/unpin. In case of failure,
> * switch to slow unmap/unpin path.
> */
> - unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
> + unmapped = unmap_unpin_fast(domain, dma, iova, len, phys,
> &unlocked, &unmapped_region_list,
> &unmapped_region_cnt,
> &iotlb_gather);
> if (!unmapped) {
> - unmapped = unmap_unpin_slow(domain, dma, &iova, len,
> + unmapped = unmap_unpin_slow(domain, dma, iova, len,
> phys, &unlocked);
> if (WARN_ON(!unmapped))
> break;
> }
> +
> + pos += unmapped;
> }
>
> dma->iommu_mapped = false;
> @@ -1212,7 +1227,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> }
>
> static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> - dma_addr_t iova, size_t size, size_t pgsize)
> + dma_addr_t iova, u64 end, size_t pgsize)
> {
> struct vfio_dma *dma;
> struct rb_node *n;
> @@ -1229,8 +1244,8 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> if (dma && dma->iova != iova)
> return -EINVAL;
>
> - dma = vfio_find_dma(iommu, iova + size - 1, 0);
> - if (dma && dma->iova + dma->size != iova + size)
> + dma = vfio_find_dma(iommu, end, 1);
> + if (dma && dma->iova + dma->size - 1 != end)
> return -EINVAL;
>
> for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> @@ -1239,7 +1254,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> if (dma->iova < iova)
> continue;
>
> - if (dma->iova > iova + size - 1)
> + if (dma->iova > end)
> break;
>
> ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
> @@ -1305,6 +1320,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> unsigned long pgshift;
> dma_addr_t iova = unmap->iova;
> u64 size = unmap->size;
> + u64 unmap_end;
> bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
> bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
> struct rb_node *n, *first_n;
> @@ -1327,11 +1343,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> if (iova || size)
> goto unlock;
> size = U64_MAX;
> - } else if (!size || size & (pgsize - 1) ||
> - iova + size - 1 < iova || size > SIZE_MAX) {
> + } else if (!size || size & (pgsize - 1) || size > SIZE_MAX) {
> goto unlock;
> }
>
> + if (check_add_overflow(iova, size - 1, &unmap_end))
> + goto unlock;
> +
> /* When dirty tracking is enabled, allow only min supported pgsize */
> if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> (!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
> @@ -1376,8 +1394,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> if (dma && dma->iova != iova)
> goto unlock;
>
> - dma = vfio_find_dma(iommu, iova + size - 1, 0);
> - if (dma && dma->iova + dma->size != iova + size)
> + dma = vfio_find_dma(iommu, unmap_end, 1);
> + if (dma && dma->iova + dma->size - 1 != unmap_end)
> goto unlock;
> }
>
> @@ -1386,7 +1404,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>
> while (n) {
> dma = rb_entry(n, struct vfio_dma, node);
> - if (dma->iova > iova + size - 1)
> + if (dma->iova > unmap_end)
> break;
>
> if (!iommu->v2 && iova > dma->iova)
> @@ -1713,12 +1731,12 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>
> for (; n; n = rb_next(n)) {
> struct vfio_dma *dma;
> - dma_addr_t iova;
> + size_t pos = 0;
>
> dma = rb_entry(n, struct vfio_dma, node);
> - iova = dma->iova;
>
> - while (iova < dma->iova + dma->size) {
> + while (pos < dma->size) {
> + dma_addr_t iova = dma->iova + pos;
> phys_addr_t phys;
> size_t size;
>
> @@ -1734,14 +1752,15 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> phys = iommu_iova_to_phys(d->domain, iova);
>
> if (WARN_ON(!phys)) {
> - iova += PAGE_SIZE;
> + pos += PAGE_SIZE;
> continue;
> }
>
> +
> size = PAGE_SIZE;
> p = phys + size;
> i = iova + size;
> - while (i < dma->iova + dma->size &&
> + while (size + pos < dma->size &&
> p == iommu_iova_to_phys(d->domain, i)) {
> size += PAGE_SIZE;
> p += PAGE_SIZE;
> @@ -1782,7 +1801,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> goto unwind;
> }
>
> - iova += size;
> + pos += size;
> }
> }
>
> @@ -1799,29 +1818,29 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> unwind:
> for (; n; n = rb_prev(n)) {
> struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> - dma_addr_t iova;
> + size_t pos = 0;
>
> if (dma->iommu_mapped) {
> iommu_unmap(domain->domain, dma->iova, dma->size);
> continue;
> }
>
> - iova = dma->iova;
> - while (iova < dma->iova + dma->size) {
> + while (pos < dma->size) {
> + dma_addr_t iova = dma->iova + pos;
> phys_addr_t phys, p;
> size_t size;
> dma_addr_t i;
>
> phys = iommu_iova_to_phys(domain->domain, iova);
> if (!phys) {
> - iova += PAGE_SIZE;
> + pos += PAGE_SIZE;
> continue;
> }
>
> size = PAGE_SIZE;
> p = phys + size;
> i = iova + size;
> - while (i < dma->iova + dma->size &&
> + while (pos + size < dma->size &&
> p == iommu_iova_to_phys(domain->domain, i)) {
> size += PAGE_SIZE;
> p += PAGE_SIZE;
> @@ -2908,6 +2927,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> unsigned long pgshift;
> size_t data_size = dirty.argsz - minsz;
> size_t iommu_pgsize;
> + u64 end;
>
> if (!data_size || data_size < sizeof(range))
> return -EINVAL;
> @@ -2916,8 +2936,12 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> sizeof(range)))
> return -EFAULT;
>
> - if (range.iova + range.size < range.iova)
> + if (range.size == 0)
> + return 0;
> +
> + if (check_add_overflow(range.iova, range.size - 1, &end))
> return -EINVAL;
> +
> if (!access_ok((void __user *)range.bitmap.data,
> range.bitmap.size))
> return -EINVAL;
> @@ -2949,7 +2973,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
> if (iommu->dirty_page_tracking)
> ret = vfio_iova_dirty_bitmap(range.bitmap.data,
> iommu, range.iova,
> - range.size,
> + end,
> range.bitmap.pgsize);
> else
> ret = -EINVAL;
Powered by blists - more mailing lists