linux-kernel - Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250805175753.GY26511@ziepe.ca>
Date: Tue, 5 Aug 2025 14:57:53 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: Mostafa Saleh <smostafa@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvmarm@...ts.linux.dev,
	linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
	maz@...nel.org, oliver.upton@...ux.dev, joey.gouly@....com,
	suzuki.poulose@....com, yuzenghui@...wei.com,
	catalin.marinas@....com, will@...nel.org, robin.murphy@....com,
	jean-philippe@...aro.org, qperret@...gle.com, tabba@...gle.com,
	mark.rutland@....com, praan@...gle.com
Subject: Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops

On Mon, Aug 04, 2025 at 02:41:31PM +0000, Mostafa Saleh wrote:
> > Are you saying the event queue is left behind for the kernel? How does
> > that work if it doesn't have access to the registers?
> 
> The evtq itself will be owned by the kernel, However, MMIO access would be
> trapped and emulated, here the PoC for part-2 of this series (as mentioned in
> the cover letter) This is very close to how nesting will work.
> https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-smmu-v3-part-2/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c#744

Oh weird, but Ok.

> > In other words you have two cleanly seperate concerns here, an "pkvm
> > iommu subsystem" that lets pkvm control iommu HW - and the current
> > "iommu subsystem" that lets the kernel control iommu HW. The same
> > driver should not register to both.
> > 
> 
> I am not sure how that would work exactly, for example how would probe_device
> work, xlate... in a generic way?

Well, I think it is not so bad actually.

You just need to call iommu_device_register

	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);

Where 'dev' is always the smmu struct device, even if your current
probing driver is not the smmu device. That will capture all the iommu
activity the ACPI/DT says is linked to that dev.

>From there you just make a normal small iommu driver with no
connection back to the SMMUv3.

Eg you could spawn an aux device from smmuv3 to do this with a far
cleaner separation.

xlate is just calling iommu_fwspec_add_ids() it is one line.

> same for other ops. We can make some of these
> functions (hypercalls wrappers) in a separate file. 

What other ops are you worried about?

static struct iommu_ops arm_smmu_ops = {
	.identity_domain	= &arm_smmu_identity_domain,
	.blocked_domain		= &arm_smmu_blocked_domain,
	.release_device		= arm_smmu_release_device,
	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,

	^^ Those are all your new domain type that does the hypercalls

	.probe_device		= arm_smmu_probe_device,
	.get_resv_regions	= arm_smmu_get_resv_regions,

  	^^ These are pretty empty

	.device_group		= arm_smmu_device_group,
	.of_xlate		= arm_smmu_of_xlate,

        ^^ Common code

Don't need these:

	.capable		= arm_smmu_capable,
	.hw_info		= arm_smmu_hw_info,
	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
	.page_response		= arm_smmu_page_response,
	.def_domain_type	= arm_smmu_def_domain_type,
	.get_viommu_size	= arm_smmu_get_viommu_size,
	.viommu_init		= arm_vsmmu_init,
	.user_pasid_table	= 1,
	.pgsize_bitmap		= -1UL, /* Restricted during device attach */
	.owner			= THIS_MODULE,

> Also I am not sure how that
> looks from the kernel perspective (do we have 2 struct devices per SMMU?)

I think you'd want to have pkvm bound to the physical struct device
and then spawn new faux, aux or something devices for the virtualized
IOMMUs that probes the new paravirt driver. This driver would be fully
self contained.

> I think we are on the same page about how that will look at the end.
> For nesting there will be a pKVM driver (as mentioned in the cover letter)
> to probe register the SMMUs, then it will unbind itself to let the
> current

That sounds a little freaky to me. I guess it can be made to work, and
is maybe the best option, but if this is the longterm plan then
starting out with a non-iommu subsystem pkvm driver seems like a good
idea.

Today the pkvm driver would do enough to boot up pkvm in this limited
mode, and maybe you have some small code duplications with the iommu
driver. It forks off a aux device to create the para-virt pkvm iommu
subsystem driver.

Tomorrow you further shrink down the pkvm driver and remove the
para-virt driver, then just somehow bind/unbind the pkvm one at early
boot.

Minimal changes to the existing smmu driver in either case.

> Then the hypervisor driver will use trap and emulate to handle SMMU access
> to the MMIO, providing an architecturally accurate SMMUv3 emualation, and
> it will not register iommu_ops.

Well, it registers normal smmuv3 ops only, I think you mean.

> So, I will happily drop the hypercalls and the iommu_ops from this series,
> if there is a better way to enlighten the hypervisor about the SIDs to be
> in identity.

The iommu ops are a reasonable and appropriate way to do this.

But since pkvm is all so special anyhow maybe you could hook
pci_enable_device() and do your identity hypercall there? Do you
already trap the config write to catch Bus Master Enable? Can pkvm
just set identity when BME=1 and ABORT when BME=0?

Jason