linux-kernel - Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dws34g6znmam7eabwetg722b4wgf2wxufcqxqphhbqlryx23mb@we5utwanawe2>
Date: Tue, 13 Jan 2026 00:56:07 +0800
From: Yu Zhang <zhangyu1@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>, "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>, 
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>, "kys@...rosoft.com" <kys@...rosoft.com>, 
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>, "wei.liu@...nel.org" <wei.liu@...nel.org>, 
	"decui@...rosoft.com" <decui@...rosoft.com>, "lpieralisi@...nel.org" <lpieralisi@...nel.org>, 
	"kwilczynski@...nel.org" <kwilczynski@...nel.org>, "mani@...nel.org" <mani@...nel.org>, 
	"robh@...nel.org" <robh@...nel.org>, "bhelgaas@...gle.com" <bhelgaas@...gle.com>, 
	"arnd@...db.de" <arnd@...db.de>, "joro@...tes.org" <joro@...tes.org>, 
	"will@...nel.org" <will@...nel.org>, "robin.murphy@....com" <robin.murphy@....com>, 
	"easwar.hariharan@...ux.microsoft.com" <easwar.hariharan@...ux.microsoft.com>, "jacob.pan@...ux.microsoft.com" <jacob.pan@...ux.microsoft.com>, 
	"nunodasneves@...ux.microsoft.com" <nunodasneves@...ux.microsoft.com>, "mrathor@...ux.microsoft.com" <mrathor@...ux.microsoft.com>, 
	"peterz@...radead.org" <peterz@...radead.org>, "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support
 for Hyper-V guest

On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@...ux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM

<snip>
Thank you so much, Michael, for the thorough review!

I've snipped some comments I fully agree with and will address in 
next version. Actually, I have to admit I agree with your remaining
comments below as well. :)

> > +struct hv_iommu_dev *hv_iommu_device;
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
> 
> Why is hv_iommu_device allocated dynamically while the two
> domains are allocated statically? Seems like the approach could
> be consistent, though maybe there's some reason I'm missing.
> 

On second thought, `hv_identity_domain` and `hv_blocking_domain` should
likely be allocated dynamically as well, consistent with `hv_iommu_device`.

<snip>
> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > +					enum hv_logical_device_property_code code,
> > +					struct hv_output_get_logical_device_property *property)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_get_logical_device_property *input;
> > +	struct hv_output_get_logical_device_property *output;
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	memset(output, 0, sizeof(*output));
> 
> General practice is to *not* zero the output area prior to a hypercall. The hypervisor
> should be correctly setting all the output bits. There are a couple of cases in the new
> MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> remove those so that hypercall call sites that have output are consistent across the
> code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the
> right thing, and zero'ing the output could be done as a workaround. But such cases
> should be explicitly known with code comments indicating the reason for the
> zero'ing.
> 
> Same applies in hv_iommu_detect().
> 

Thanks for the information! Just to clarify: this is only because Hyper-V is
supposed to zero the output page, and for input page, memset is still needed.
Am I correct?

<snip>

> > +static void hv_iommu_shutdown(void)
> > +{
> > +	iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > +
> > +	kfree(hv_iommu_device);
> > +}
> > +
> > +static struct syscore_ops hv_iommu_syscore_ops = {
> > +	.shutdown = hv_iommu_shutdown,
> > +};
> 
> Why is a shutdown needed at all?  hv_iommu_shutdown() doesn't do anything
> that really needed, since sysfs entries are transient, and freeing memory isn't
> relevant for a shutdown.
> 

For iommu_device_sysfs_remove(), I guess they are not necessary, and
I will need to do some homework to better understand the sysfs. :)
Originally, we wanted a shutdown routine to trigger some hypercall,
so that Hyper-V will disable the DMA translation, e.g., during the VM
reboot process. 

<snip>

> > +device_initcall(hv_iommu_init);
> 
> I'm concerned about the timing of this initialization. VMBus is initialized with
> subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6.
> So VMBus initialization happens quite a bit earlier, and the hypervisor starts
> offering devices to the guest, including PCI pass-thru devices, before the
> IOMMU initialization starts. I cobbled together a way to make this IOMMU code
> run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> completed hv_pci_probe() before this IOMMU initialization was started. When
> IOMMU initialization did run, it went back and found the NVMe devices. But
> I'm unsure if that's OK because my hacked together environment obviously
> couldn't do real IOMMU mapping. It appears that the NVMe device driver
> didn't start its initialization until after the IOMMU driver was setup, which
> would probably make everything OK. But that might be just timing luck, or
> maybe there's something that affirmatively prevents the native PCI driver
> (like NVMe) from getting started until after all the initcalls have finished.
> 

This is yet another immature attempt by me to do the hv_iommu_init() in
an arch-independent path. And I do not think using device_initcall() is
harmless. This patch set was tested using an assigned Intel DSA device,
and the DMA tests succeeded w/o any error. But that is not enough to
justify using device_initcall(): I reset the idxd driver as kernel
builtin and realized, just like you said, both hv_pci_probe() and
idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
tries to probe the endpoint device, a warning is printed:

[    3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something fishy here!

> I'm planning to look at this further to see if there's a way for a PCI driver
> to try initializing a pass-thru device *before* this IOMMU driver has initialized.
> If so, a different way to do the IOMMU initialization will be needed that is
> linked to VMBus initialization so things can't happen out-of-order. Establishing
> such a linkage is probably a good idea regardless.
> 
> FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> and devices coming and going dynamically all worked correctly. When a device
> was removed, it was moved to the blocking domain, and then flushed before
> being finally removed. All good! I wish I had a way to test with an IOMMU
> paging domain that was doing real translation.
> 

Thank you, Michael! I really appreciate you running these extra experiments!

My tests on this DSA device passed (using paging domain) too, with no DMA
errors observed (regardless its driver is builtin or as a kernel module).
But that doesn't make me confident about using `device_initcall`. I believe
your concern is valid. E.g., an endpoint device might allocate a DMA address(
using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
use that address for DMA later, after a paging domain is attached?

> > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > new file mode 100644
> > index 000000000000..c8657e791a6e
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv/iommu.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Hyper-V IOMMU driver.
> > + *
> > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > + *
> > + */
> > +
> > +#ifndef _HYPERV_IOMMU_H
> > +#define _HYPERV_IOMMU_H
> > +
> > +struct hv_iommu_dev {
> > +	struct iommu_device iommu;
> > +	struct ida domain_ids;
> > +
> > +	/* Device configuration */
> > +	u8  max_iova_width;
> > +	u8  max_pasid_width;
> > +	u64 cap;
> > +	u64 pgsize_bitmap;
> > +
> > +	struct iommu_domain_geometry geometry;
> > +	u64 first_domain;
> > +	u64 last_domain;
> > +};
> > +
> > +struct hv_iommu_domain {
> > +	union {
> > +		struct iommu_domain    domain;
> > +		struct pt_iommu        pt_iommu;
> > +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> > +	};
> > +	struct hv_iommu_dev *hv_iommu;
> > +	struct hv_input_device_domain device_domain;
> > +	u64		pgsize_bitmap;
> > +
> > +	spinlock_t lock; /* protects dev_list and TLB flushes */
> > +	/* List of devices in this DMA domain */
> 
> It appears that this list is really a list of endpoints (i.e., struct
> hv_iommu_endpoint), not devices (which I read to be struct
> hv_iommu_dev). 
> 
> But that said, what is the list used for?  I see code to add
> endpoints to the list, and to remove then, but the list is never
> walked by any code in this patch set. If there is an anticipated
> future use, it would be better to add the list as part of the code
> for that future use.
> 

Yes, we do not really need this list for this patch set. Thanks!

B.R.
Yu