linux-kernel - Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXeO7wh7bpacJ1Sk@skinsburskii.localdomain>
Date: Mon, 26 Jan 2026 07:57:35 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Mukesh R <mrathor@...ux.microsoft.com>
Cc: linux-kernel@...r.kernel.org, linux-hyperv@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
	linux-pci@...r.kernel.org, linux-arch@...r.kernel.org,
	kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
	decui@...rosoft.com, longli@...rosoft.com, catalin.marinas@....com,
	will@...nel.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
	dave.hansen@...ux.intel.com, hpa@...or.com, joro@...tes.org,
	lpieralisi@...nel.org, kwilczynski@...nel.org, mani@...nel.org,
	robh@...nel.org, bhelgaas@...gle.com, arnd@...db.de,
	nunodasneves@...ux.microsoft.com, mhklinux@...look.com,
	romank@...ux.microsoft.com
Subject: Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu

On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@...ux.microsoft.com>
> > > 
> > > Add a new file to implement management of device domains, mapping and
> > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > parent. This also implements direct attach mechanism for PCI passthru,
> > > and it is also made to work within the VFIO framework.
> > > 
> > > At a high level, during boot the hypervisor creates a default identity
> > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > PCI passthru:
> > > 
> > >    1. Device Domain: root must create a device domain in the hypervisor,
> > >       and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > >       All hypervisor communications use device id of type PCI for
> > >       identifying and referencing the device.
> > > 
> > >    2. Direct Attach: the hypervisor will simply use the guest's HW
> > >       page table for mappings, thus the host need not do map/unmap
> > >       device memory hypercalls. As such, direct attach passthru setup
> > >       during guest boot is extremely fast. A direct attached device
> > >       must be referenced via logical device id and not via the PCI
> > >       device id.
> > > 
> > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > attach is default in non-L1VH cases because there are some significant
> > > performance issues with device domain implementation currently for guests
> > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > addressed in the short term.
> > > 
> > 
> > <snip>
> >

<snip>

> > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > +{
> > > +	struct pci_dev *pdev;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > +	/* See the attach function, only PCI devices for now */
> > > +	if (!dev_is_pci(dev))
> > > +		return;
> > > +
> > > +	if (hvdom->num_attchd == 0)
> > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > +
> > > +	pdev = to_pci_dev(dev);
> > > +
> > > +	if (hvdom->attached_dom) {
> > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > +
> > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > +		 * next.
> > > +		 */
> > > +	} else {
> > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > +	}
> > > +
> > > +	hvdom->num_attchd--;
> > 
> > Shouldn't this be modified iff the detach succeeded?
> 
> We want to still free the domain and not let it get stuck. The purpose
> is more to make sure detach was called before domain free.
> 

How can one debug subseqent errors if num_attchd is decremented
unconditionally? In reality the device is left attached, but the related
kernel metadata is gone.

> > > +}
> > > +
> > > +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> > > +				     unsigned long iova, phys_addr_t paddr,
> > > +				     size_t size, u32 flags)
> > > +{
> > > +	unsigned long irqflags;
> > > +	struct hv_iommu_mapping *mapping;
> > > +
> > > +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> > > +	if (!mapping)
> > > +		return -ENOMEM;
> > > +
> > > +	mapping->paddr = paddr;
> > > +	mapping->iova.start = iova;
> > > +	mapping->iova.last = iova + size - 1;
> > > +	mapping->flags = flags;
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> > > +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> > > +					unsigned long iova, size_t size)
> > > +{
> > > +	unsigned long flags;
> > > +	size_t unmapped = 0;
> > > +	unsigned long last = iova + size - 1;
> > > +	struct hv_iommu_mapping *mapping = NULL;
> > > +	struct interval_tree_node *node, *next;
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> > > +	while (next) {
> > > +		node = next;
> > > +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > +		next = interval_tree_iter_next(node, iova, last);
> > > +
> > > +		/* Trying to split a mapping? Not supported for now. */
> > > +		if (mapping->iova.start < iova)
> > > +			break;
> > > +
> > > +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> > > +
> > > +		interval_tree_remove(node, &hvdom->mappings_tree);
> > > +		kfree(mapping);
> > > +	}
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > +	return unmapped;
> > > +}
> > > +
> > > +/* Return: must return exact status from the hypercall without changes */
> > > +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> > > +			    unsigned long iova, phys_addr_t paddr,
> > > +			    unsigned long npages, u32 map_flags)
> > > +{
> > > +	u64 status;
> > > +	int i;
> > > +	struct hv_input_map_device_gpa_pages *input;
> > > +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > +
> > > +	local_irq_save(flags);
> > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > +	memset(input, 0, sizeof(*input));
> > > +
> > > +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > +	input->device_domain.domain_id.id = hvdom->domid_num;
> > > +	input->map_flags = map_flags;
> > > +	input->target_device_va_base = iova;
> > > +
> > > +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > +	for (i = 0; i < npages; i++, pfn++)
> > > +		input->gpa_page_list[i] = pfn;
> > > +
> > > +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> > > +				     input, NULL);
> > > +
> > > +	local_irq_restore(flags);
> > > +	return status;
> > > +}
> > > +
> > > +/*
> > > + * The core VFIO code loops over memory ranges calling this function with
> > > + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
> > > + */
> > > +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> > > +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> > > +			      int prot, gfp_t gfp, size_t *mapped)
> > > +{
> > > +	u32 map_flags;
> > > +	int ret;
> > > +	u64 status;
> > > +	unsigned long npages, done = 0;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +	size_t size = pgsize * pgcount;
> > > +
> > > +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> > > +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> > > +
> > > +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (hvdom->attached_dom) {
> > > +		*mapped = size;
> > > +		return 0;
> > > +	}
> > > +
> > > +	npages = size >> HV_HYP_PAGE_SHIFT;
> > > +	while (done < npages) {
> > > +		ulong completed, remain = npages - done;
> > > +
> > > +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> > > +					  map_flags);
> > > +
> > > +		completed = hv_repcomp(status);
> > > +		done = done + completed;
> > > +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> > > +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> > > +
> > > +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> > > +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > +						    hv_current_partition_id,
> > > +						    256);
> > > +			if (ret)
> > > +				break;
> > > +		}
> > > +		if (!hv_result_success(status))
> > > +			break;
> > > +	}
> > > +
> > > +	if (!hv_result_success(status)) {
> > > +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> > > +
> > > +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> > > +			      done, npages, iova);
> > > +		/*
> > > +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
> > > +		 * only remove from [0 - done], we need to remove second chunk
> > > +		 * [done+1 - size-1].
> > > +		 */
> > > +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
> > > +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
> > > +				     done, NULL);
> > > +		if (mapped)
> > > +			*mapped = 0;
> > > +	} else
> > > +		if (mapped)
> > > +			*mapped = size;
> > > +
> > > +	return hv_result_to_errno(status);
> > > +}
> > > +
> > > +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> > > +				   size_t pgsize, size_t pgcount,
> > > +				   struct iommu_iotlb_gather *gather)
> > > +{
> > > +	unsigned long flags, npages;
> > > +	struct hv_input_unmap_device_gpa_pages *input;
> > > +	u64 status;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +	size_t unmapped, size = pgsize * pgcount;
> > > +
> > > +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> > > +	if (unmapped < size)
> > > +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
> > > +		       __func__, iova, unmapped, size);
> > > +
> > > +	if (hvdom->attached_dom)
> > > +		return size;
> > > +
> > > +	npages = size >> HV_HYP_PAGE_SHIFT;
> > > +
> > > +	local_irq_save(flags);
> > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > +	memset(input, 0, sizeof(*input));
> > > +
> > > +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > +	input->device_domain.domain_id.id = hvdom->domid_num;
> > > +	input->target_device_va_base = iova;
> > > +
> > > +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> > > +				     0, input, NULL);
> > > +	local_irq_restore(flags);
> > > +
> > > +	if (!hv_result_success(status))
> > > +		hv_status_err(status, "\n");
> > > +
> > 
> > There is some inconsistency in namings and behaviour of paired
> > functions:
> > 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
> 
> The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
> hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.
> 

Hv_iommu_map_pages is a wrapper around hv_iommu_map_pgs while
hv_iommu_unmap_pages is a wrapper around the correspodning hypercall.
That's the inconsistency I meant.

> > 2. hv_iommu_map_pgs doesn't print status in case of error.
> 
> it does:
>             hv_status_err(status, "\n");  <==============

It does not. I guess you are confusing it with some other function.
Here is the function:

> 
> 
> > It would be much better to keep this code consistent.
> > 
> > > +	return unmapped;
> > > +}
> > > +
> > > +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> > > +					 dma_addr_t iova)
> > > +{
> > > +	u64 paddr = 0;
> > > +	unsigned long flags;
> > > +	struct hv_iommu_mapping *mapping;
> > > +	struct interval_tree_node *node;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
> > > +	if (node) {
> > > +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > +		paddr = mapping->paddr + (iova - mapping->iova.start);
> > > +	}
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > +	return paddr;
> > > +}
> > > +
> > > +/*
> > > + * Currently, hypervisor does not provide list of devices it is using
> > > + * dynamically. So use this to allow users to manually specify devices that
> > > + * should be skipped. (eg. hypervisor debugger using some network device).
> > > + */
> > > +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> > > +{
> > > +	if (!dev_is_pci(dev))
> > > +		return ERR_PTR(-ENODEV);
> > > +
> > > +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> > > +		int rc, pos = 0;
> > > +		int parsed;
> > > +		int segment, bus, slot, func;
> > > +		struct pci_dev *pdev = to_pci_dev(dev);
> > > +
> > > +		do {
> > > +			parsed = 0;
> > > +
> > > +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> > > +				    &segment, &bus, &slot, &func, &parsed);
> > > +			if (rc)
> > > +				break;
> > > +			if (parsed <= 0)
> > > +				break;
> > > +
> > > +			if (pci_domain_nr(pdev->bus) == segment &&
> > > +			    pdev->bus->number == bus &&
> > > +			    PCI_SLOT(pdev->devfn) == slot &&
> > > +			    PCI_FUNC(pdev->devfn) == func) {
> > > +
> > > +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
> > > +				return ERR_PTR(-ENODEV);
> > > +			}
> > > +			pos += parsed;
> > > +
> > > +		} while (pci_devs_to_skip[pos]);
> > > +	}
> > > +
> > > +	/* Device will be explicitly attached to the default domain, so no need
> > > +	 * to do dev_iommu_priv_set() here.
> > > +	 */
> > > +
> > > +	return &hv_virt_iommu;
> > > +}
> > > +
> > > +static void hv_iommu_probe_finalize(struct device *dev)
> > > +{
> > > +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> > > +
> > > +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> > > +		iommu_setup_dma_ops(dev);
> > > +	else
> > > +		set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static void hv_iommu_release_device(struct device *dev)
> > > +{
> > > +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> > > +
> > > +	/* Need to detach device from device domain if necessary. */
> > > +	if (hvdom)
> > > +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> > > +
> > > +	dev_iommu_priv_set(dev, NULL);
> > > +	set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> > > +{
> > > +	if (dev_is_pci(dev))
> > > +		return pci_device_group(dev);
> > > +	else
> > > +		return generic_device_group(dev);
> > > +}
> > > +
> > > +static int hv_iommu_def_domain_type(struct device *dev)
> > > +{
> > > +	/* The hypervisor always creates this by default during boot */
> > > +	return IOMMU_DOMAIN_IDENTITY;
> > > +}
> > > +
> > > +static struct iommu_ops hv_iommu_ops = {
> > > +	.capable	    = hv_iommu_capable,
> > > +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
> > > +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> > > +	.probe_device	    = hv_iommu_probe_device,
> > > +	.probe_finalize     = hv_iommu_probe_finalize,
> > > +	.release_device     = hv_iommu_release_device,
> > > +	.def_domain_type    = hv_iommu_def_domain_type,
> > > +	.device_group	    = hv_iommu_device_group,
> > > +	.default_domain_ops = &(const struct iommu_domain_ops) {
> > > +		.attach_dev   = hv_iommu_attach_dev,
> > > +		.map_pages    = hv_iommu_map_pages,
> > > +		.unmap_pages  = hv_iommu_unmap_pages,
> > > +		.iova_to_phys = hv_iommu_iova_to_phys,
> > > +		.free	      = hv_iommu_domain_free,
> > > +	},
> > > +	.owner		    = THIS_MODULE,
> > > +};
> > > +
> > > +static void __init hv_initialize_special_domains(void)
> > > +{
> > > +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> > > +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
> > 
> > hv_def_identity_dom is a static global variable.
> > Why not initialize hv_def_identity_dom upon definition instead of
> > introducing a new function?
> 
> Originally, it was function. I changed it static, but during 6.6
> review I changed it back to function.  I can't remember why, but is
> pretty harmless. We may add more domains, for example null domain to the
> initilization in future.
> 
> > > +}
> > > +
> > > +static int __init hv_iommu_init(void)
> > > +{
> > > +	int ret;
> > > +	struct iommu_device *iommup = &hv_virt_iommu;
> > > +
> > > +	if (!hv_is_hyperv_initialized())
> > > +		return -ENODEV;
> > > +
> > > +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
> > > +	if (ret) {
> > > +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	/* This must come before iommu_device_register because the latter calls
> > > +	 * into the hooks.
> > > +	 */
> > > +	hv_initialize_special_domains();
> > > +
> > > +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> > 
> > It looks weird to initialize an object after creating sysfs entries for
> > it.
> > It should be the other way around.
> 
> Not sure if it should be, much easier to remove sysfs entry than other
> cleanup, even tho iommu_device_unregister is there. I am sure we'll add
> more code here, probably why it was originally done this way.
> 

Sysfs provides user space access to kernel objects. If the object is not
initialized, it's not only a useless sysfs entry, but also a potential
cause for kernel panic if user space will try to access this entry
before the object is initialized.

Thanks,
Stanislav


> Thanks,
> -Mukesh
> 
> 
> ... snip........