[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXeO7wh7bpacJ1Sk@skinsburskii.localdomain>
Date: Mon, 26 Jan 2026 07:57:35 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Mukesh R <mrathor@...ux.microsoft.com>
Cc: linux-kernel@...r.kernel.org, linux-hyperv@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
linux-pci@...r.kernel.org, linux-arch@...r.kernel.org,
kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
decui@...rosoft.com, longli@...rosoft.com, catalin.marinas@....com,
will@...nel.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, joro@...tes.org,
lpieralisi@...nel.org, kwilczynski@...nel.org, mani@...nel.org,
robh@...nel.org, bhelgaas@...gle.com, arnd@...db.de,
nunodasneves@...ux.microsoft.com, mhklinux@...look.com,
romank@...ux.microsoft.com
Subject: Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@...ux.microsoft.com>
> > >
> > > Add a new file to implement management of device domains, mapping and
> > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > parent. This also implements direct attach mechanism for PCI passthru,
> > > and it is also made to work within the VFIO framework.
> > >
> > > At a high level, during boot the hypervisor creates a default identity
> > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > PCI passthru:
> > >
> > > 1. Device Domain: root must create a device domain in the hypervisor,
> > > and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > All hypervisor communications use device id of type PCI for
> > > identifying and referencing the device.
> > >
> > > 2. Direct Attach: the hypervisor will simply use the guest's HW
> > > page table for mappings, thus the host need not do map/unmap
> > > device memory hypercalls. As such, direct attach passthru setup
> > > during guest boot is extremely fast. A direct attached device
> > > must be referenced via logical device id and not via the PCI
> > > device id.
> > >
> > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > attach is default in non-L1VH cases because there are some significant
> > > performance issues with device domain implementation currently for guests
> > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > addressed in the short term.
> > >
> >
> > <snip>
> >
<snip>
> > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > +{
> > > + struct pci_dev *pdev;
> > > + struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > + /* See the attach function, only PCI devices for now */
> > > + if (!dev_is_pci(dev))
> > > + return;
> > > +
> > > + if (hvdom->num_attchd == 0)
> > > + pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > +
> > > + pdev = to_pci_dev(dev);
> > > +
> > > + if (hvdom->attached_dom) {
> > > + hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > +
> > > + /* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > + * next.
> > > + */
> > > + } else {
> > > + hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > + }
> > > +
> > > + hvdom->num_attchd--;
> >
> > Shouldn't this be modified iff the detach succeeded?
>
> We want to still free the domain and not let it get stuck. The purpose
> is more to make sure detach was called before domain free.
>
How can one debug subseqent errors if num_attchd is decremented
unconditionally? In reality the device is left attached, but the related
kernel metadata is gone.
> > > +}
> > > +
> > > +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> > > + unsigned long iova, phys_addr_t paddr,
> > > + size_t size, u32 flags)
> > > +{
> > > + unsigned long irqflags;
> > > + struct hv_iommu_mapping *mapping;
> > > +
> > > + mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> > > + if (!mapping)
> > > + return -ENOMEM;
> > > +
> > > + mapping->paddr = paddr;
> > > + mapping->iova.start = iova;
> > > + mapping->iova.last = iova + size - 1;
> > > + mapping->flags = flags;
> > > +
> > > + spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> > > + interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> > > + spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> > > + unsigned long iova, size_t size)
> > > +{
> > > + unsigned long flags;
> > > + size_t unmapped = 0;
> > > + unsigned long last = iova + size - 1;
> > > + struct hv_iommu_mapping *mapping = NULL;
> > > + struct interval_tree_node *node, *next;
> > > +
> > > + spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > + next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> > > + while (next) {
> > > + node = next;
> > > + mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > + next = interval_tree_iter_next(node, iova, last);
> > > +
> > > + /* Trying to split a mapping? Not supported for now. */
> > > + if (mapping->iova.start < iova)
> > > + break;
> > > +
> > > + unmapped += mapping->iova.last - mapping->iova.start + 1;
> > > +
> > > + interval_tree_remove(node, &hvdom->mappings_tree);
> > > + kfree(mapping);
> > > + }
> > > + spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > + return unmapped;
> > > +}
> > > +
> > > +/* Return: must return exact status from the hypercall without changes */
> > > +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> > > + unsigned long iova, phys_addr_t paddr,
> > > + unsigned long npages, u32 map_flags)
> > > +{
> > > + u64 status;
> > > + int i;
> > > + struct hv_input_map_device_gpa_pages *input;
> > > + unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > +
> > > + local_irq_save(flags);
> > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > + memset(input, 0, sizeof(*input));
> > > +
> > > + input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > + input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > + input->device_domain.domain_id.id = hvdom->domid_num;
> > > + input->map_flags = map_flags;
> > > + input->target_device_va_base = iova;
> > > +
> > > + pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > + for (i = 0; i < npages; i++, pfn++)
> > > + input->gpa_page_list[i] = pfn;
> > > +
> > > + status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> > > + input, NULL);
> > > +
> > > + local_irq_restore(flags);
> > > + return status;
> > > +}
> > > +
> > > +/*
> > > + * The core VFIO code loops over memory ranges calling this function with
> > > + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
> > > + */
> > > +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> > > + phys_addr_t paddr, size_t pgsize, size_t pgcount,
> > > + int prot, gfp_t gfp, size_t *mapped)
> > > +{
> > > + u32 map_flags;
> > > + int ret;
> > > + u64 status;
> > > + unsigned long npages, done = 0;
> > > + struct hv_domain *hvdom = to_hv_domain(immdom);
> > > + size_t size = pgsize * pgcount;
> > > +
> > > + map_flags = HV_MAP_GPA_READABLE; /* required */
> > > + map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> > > +
> > > + ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (hvdom->attached_dom) {
> > > + *mapped = size;
> > > + return 0;
> > > + }
> > > +
> > > + npages = size >> HV_HYP_PAGE_SHIFT;
> > > + while (done < npages) {
> > > + ulong completed, remain = npages - done;
> > > +
> > > + status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> > > + map_flags);
> > > +
> > > + completed = hv_repcomp(status);
> > > + done = done + completed;
> > > + iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> > > + paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> > > +
> > > + if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> > > + ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > + hv_current_partition_id,
> > > + 256);
> > > + if (ret)
> > > + break;
> > > + }
> > > + if (!hv_result_success(status))
> > > + break;
> > > + }
> > > +
> > > + if (!hv_result_success(status)) {
> > > + size_t done_size = done << HV_HYP_PAGE_SHIFT;
> > > +
> > > + hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> > > + done, npages, iova);
> > > + /*
> > > + * lookup tree has all mappings [0 - size-1]. Below unmap will
> > > + * only remove from [0 - done], we need to remove second chunk
> > > + * [done+1 - size-1].
> > > + */
> > > + hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
> > > + hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
> > > + done, NULL);
> > > + if (mapped)
> > > + *mapped = 0;
> > > + } else
> > > + if (mapped)
> > > + *mapped = size;
> > > +
> > > + return hv_result_to_errno(status);
> > > +}
> > > +
> > > +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> > > + size_t pgsize, size_t pgcount,
> > > + struct iommu_iotlb_gather *gather)
> > > +{
> > > + unsigned long flags, npages;
> > > + struct hv_input_unmap_device_gpa_pages *input;
> > > + u64 status;
> > > + struct hv_domain *hvdom = to_hv_domain(immdom);
> > > + size_t unmapped, size = pgsize * pgcount;
> > > +
> > > + unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> > > + if (unmapped < size)
> > > + pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
> > > + __func__, iova, unmapped, size);
> > > +
> > > + if (hvdom->attached_dom)
> > > + return size;
> > > +
> > > + npages = size >> HV_HYP_PAGE_SHIFT;
> > > +
> > > + local_irq_save(flags);
> > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > + memset(input, 0, sizeof(*input));
> > > +
> > > + input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > + input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > + input->device_domain.domain_id.id = hvdom->domid_num;
> > > + input->target_device_va_base = iova;
> > > +
> > > + status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> > > + 0, input, NULL);
> > > + local_irq_restore(flags);
> > > +
> > > + if (!hv_result_success(status))
> > > + hv_status_err(status, "\n");
> > > +
> >
> > There is some inconsistency in namings and behaviour of paired
> > functions:
> > 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
>
> The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
> hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.
>
Hv_iommu_map_pages is a wrapper around hv_iommu_map_pgs while
hv_iommu_unmap_pages is a wrapper around the correspodning hypercall.
That's the inconsistency I meant.
> > 2. hv_iommu_map_pgs doesn't print status in case of error.
>
> it does:
> hv_status_err(status, "\n"); <==============
It does not. I guess you are confusing it with some other function.
Here is the function:
>
>
> > It would be much better to keep this code consistent.
> >
> > > + return unmapped;
> > > +}
> > > +
> > > +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> > > + dma_addr_t iova)
> > > +{
> > > + u64 paddr = 0;
> > > + unsigned long flags;
> > > + struct hv_iommu_mapping *mapping;
> > > + struct interval_tree_node *node;
> > > + struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > + spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > + node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
> > > + if (node) {
> > > + mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > + paddr = mapping->paddr + (iova - mapping->iova.start);
> > > + }
> > > + spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > + return paddr;
> > > +}
> > > +
> > > +/*
> > > + * Currently, hypervisor does not provide list of devices it is using
> > > + * dynamically. So use this to allow users to manually specify devices that
> > > + * should be skipped. (eg. hypervisor debugger using some network device).
> > > + */
> > > +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> > > +{
> > > + if (!dev_is_pci(dev))
> > > + return ERR_PTR(-ENODEV);
> > > +
> > > + if (pci_devs_to_skip && *pci_devs_to_skip) {
> > > + int rc, pos = 0;
> > > + int parsed;
> > > + int segment, bus, slot, func;
> > > + struct pci_dev *pdev = to_pci_dev(dev);
> > > +
> > > + do {
> > > + parsed = 0;
> > > +
> > > + rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> > > + &segment, &bus, &slot, &func, &parsed);
> > > + if (rc)
> > > + break;
> > > + if (parsed <= 0)
> > > + break;
> > > +
> > > + if (pci_domain_nr(pdev->bus) == segment &&
> > > + pdev->bus->number == bus &&
> > > + PCI_SLOT(pdev->devfn) == slot &&
> > > + PCI_FUNC(pdev->devfn) == func) {
> > > +
> > > + dev_info(dev, "skipped by Hyper-V IOMMU\n");
> > > + return ERR_PTR(-ENODEV);
> > > + }
> > > + pos += parsed;
> > > +
> > > + } while (pci_devs_to_skip[pos]);
> > > + }
> > > +
> > > + /* Device will be explicitly attached to the default domain, so no need
> > > + * to do dev_iommu_priv_set() here.
> > > + */
> > > +
> > > + return &hv_virt_iommu;
> > > +}
> > > +
> > > +static void hv_iommu_probe_finalize(struct device *dev)
> > > +{
> > > + struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> > > +
> > > + if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> > > + iommu_setup_dma_ops(dev);
> > > + else
> > > + set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static void hv_iommu_release_device(struct device *dev)
> > > +{
> > > + struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> > > +
> > > + /* Need to detach device from device domain if necessary. */
> > > + if (hvdom)
> > > + hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> > > +
> > > + dev_iommu_priv_set(dev, NULL);
> > > + set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> > > +{
> > > + if (dev_is_pci(dev))
> > > + return pci_device_group(dev);
> > > + else
> > > + return generic_device_group(dev);
> > > +}
> > > +
> > > +static int hv_iommu_def_domain_type(struct device *dev)
> > > +{
> > > + /* The hypervisor always creates this by default during boot */
> > > + return IOMMU_DOMAIN_IDENTITY;
> > > +}
> > > +
> > > +static struct iommu_ops hv_iommu_ops = {
> > > + .capable = hv_iommu_capable,
> > > + .domain_alloc_identity = hv_iommu_domain_alloc_identity,
> > > + .domain_alloc_paging = hv_iommu_domain_alloc_paging,
> > > + .probe_device = hv_iommu_probe_device,
> > > + .probe_finalize = hv_iommu_probe_finalize,
> > > + .release_device = hv_iommu_release_device,
> > > + .def_domain_type = hv_iommu_def_domain_type,
> > > + .device_group = hv_iommu_device_group,
> > > + .default_domain_ops = &(const struct iommu_domain_ops) {
> > > + .attach_dev = hv_iommu_attach_dev,
> > > + .map_pages = hv_iommu_map_pages,
> > > + .unmap_pages = hv_iommu_unmap_pages,
> > > + .iova_to_phys = hv_iommu_iova_to_phys,
> > > + .free = hv_iommu_domain_free,
> > > + },
> > > + .owner = THIS_MODULE,
> > > +};
> > > +
> > > +static void __init hv_initialize_special_domains(void)
> > > +{
> > > + hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> > > + hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
> >
> > hv_def_identity_dom is a static global variable.
> > Why not initialize hv_def_identity_dom upon definition instead of
> > introducing a new function?
>
> Originally, it was function. I changed it static, but during 6.6
> review I changed it back to function. I can't remember why, but is
> pretty harmless. We may add more domains, for example null domain to the
> initilization in future.
>
> > > +}
> > > +
> > > +static int __init hv_iommu_init(void)
> > > +{
> > > + int ret;
> > > + struct iommu_device *iommup = &hv_virt_iommu;
> > > +
> > > + if (!hv_is_hyperv_initialized())
> > > + return -ENODEV;
> > > +
> > > + ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
> > > + if (ret) {
> > > + pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
> > > + return ret;
> > > + }
> > > +
> > > + /* This must come before iommu_device_register because the latter calls
> > > + * into the hooks.
> > > + */
> > > + hv_initialize_special_domains();
> > > +
> > > + ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> >
> > It looks weird to initialize an object after creating sysfs entries for
> > it.
> > It should be the other way around.
>
> Not sure if it should be, much easier to remove sysfs entry than other
> cleanup, even tho iommu_device_unregister is there. I am sure we'll add
> more code here, probably why it was originally done this way.
>
Sysfs provides user space access to kernel objects. If the object is not
initialized, it's not only a useless sysfs entry, but also a potential
cause for kernel panic if user space will try to access this entry
before the object is initialized.
Thanks,
Stanislav
> Thanks,
> -Mukesh
>
>
> ... snip........
Powered by blists - more mailing lists