linux-kernel - Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zlf/84H+cfAfaZj6@Asurada-Nvidia>
Date: Wed, 29 May 2024 21:26:27 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: "Tian, Kevin" <kevin.tian@...el.com>
CC: Jason Gunthorpe <jgg@...dia.com>, "will@...nel.org" <will@...nel.org>,
	"robin.murphy@....com" <robin.murphy@....com>,
	"suravee.suthikulpanit@....com" <suravee.suthikulpanit@....com>,
	"joro@...tes.org" <joro@...tes.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "iommu@...ts.linux.dev"
	<iommu@...ts.linux.dev>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "linux-tegra@...r.kernel.org"
	<linux-tegra@...r.kernel.org>, "Liu, Yi L" <yi.l.liu@...el.com>,
	"eric.auger@...hat.com" <eric.auger@...hat.com>, "vasant.hegde@....com"
	<vasant.hegde@....com>, "jon.grimm@....com" <jon.grimm@....com>,
	"santosh.shukla@....com" <santosh.shukla@....com>, "Dhaval.Giani@....com"
	<Dhaval.Giani@....com>, "shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>
Subject: Re: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

On Thu, May 30, 2024 at 03:05:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@...dia.com>
> > Sent: Thursday, May 30, 2024 8:59 AM
> > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@...dia.com>
> > > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > > My question is why that option is chosen instead of going with 1:1
> > > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > > how VT-d is virtualized.
> > > > >
> > > > > I want to know whether doing so is simply to be compatible with
> > > > > what VCMDQ requires, or due to another untold reason.
> > > >
> > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > > >
> > >
> > > there are other options, e.g. you can have one viommu holding multiple
> > > VMIDs each associating to a pSMMU.
> >
> > Well, possibly. But everything previously in a viommu would have
> > to be a list (for number of VMIDs) in the kernel level: not only
> > a VMID list, but also a 2D virtual ID lists, something like:
> >
> > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup
> 
> ah does it mean that dev ID space on ARM is local to SMMU?
> 
> It's not the case on x86 platforms where the RID is platform-wise.

Actually I had a second thought after I replied. Yea, we only
support PCI devices at this moment, so their RIDs come from BDF
numbers, and then should be platform-wise as you pointed out:

|<-------VMM-------->|<--------------- kernel ------------>|
 vRID_a --|                       |-- VMIDx / SMMU0 -- pRID_A
 vRID_b --| => vSMMU => viommu => |-- VMIDy / SMMU1 -- pRID_B
 vRID_c --|                       |-- VMIDz / SMMU2 -- pRID_C

# x/y/z can be identical, while a/b/c and A/B/C must be unique.

So likely a single lookup list can be enough. That still can't
avoid the list of per-pIOMMU objects for some driver feature
though. So, I think having a per-pIOMMU viommu also adds a bit
of flexibility for the kernel.

Overall, indeed an implementation choice, as you mentioned :)

> > And a driver in this case would be difficult to get a complete
> > concept of a viommu object since it's core object and shared by
> > all kernel-level IOMMU instances. If a driver wants to extend a
> > viommu object for some additional feature, e.g. VINTF in this
> > series, it would likely have to create another per-driver object
> > and again another list of this kind of objects in struct viommu.
> >
> > At the end of day, we have to duplicate it one way or another for
> > the amount of physical IOMMUs. And it seems to cleaner by doing
> > it with multiple viommu objects.
> >
> > > so it's really an implementation choice that you want a same viommu
> > > topology scheme w/ or w/o VCMDQ.
> > >
> > > we all know there are some constraints of copying the physical topology,
> > > e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> > > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> > > you want to keep more flexibility for the user. 😊
> >
> > Oh. With regular nested SMMU, there is only one virtual SMMU in
> > the guest VM. No need of copying physical topology. Just the VMM
> > needs to allocate three viommus to add them to a list of its own.
> >
> 
> Okay I missed this point. Then the number of viommus is really about
> the contract between the vmm and the kernel, which is invisible to
> the guest or the admin who launches the Qemu.

Yes. Everything should be behind the scene, since VMM can trap
and select, unlike VCMDQ doing direct MMIO to the HW without a
chance of VM Exits.

> but wait, isn't there another problem - does the VMM have the
> permission to enumerate the topology of physical SMMUs? Or probably
> the VMM only needs to know the number of relevant SMMUs for
> assigned devices and such info can be indirectly composed by extending  GET_HW_INFO...

I think VMM already has some kinda permission reading the number
of IOMMUs from "/sys/class/iommu"?

That being said, in a regular nesting case w/o VCMDQ, there is
no need to instantiate all vSMMUs to the number of pSMMUs unless
there is at least one device behind each pSMMU attaches.

So, the current implementation allocates an s2_hwpt/viommu only
on demand, when the first device from a physical SMMU attaches,
meaning there'll be only one viommu in the list until the other
first device from another pSMMU attaches. And the actual attach
logical always tries attaching a device to the existing viommus
from the list, and only allocates a new viommu (or s2_hwpt) when
all of them failed. FWIW, host-kernel/driver has a full control
against these attaches.

Thanks
Nicolin