[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231211160632.GI2944114@nvidia.com>
Date: Mon, 11 Dec 2023 12:06:32 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: "Suthikulpanit, Suravee" <suravee.suthikulpanit@....com>
Cc: Yi Liu <yi.l.liu@...el.com>, "Tian, Kevin" <kevin.tian@...el.com>,
"Giani, Dhaval" <Dhaval.Giani@....com>,
Vasant Hegde <vasant.hegde@....com>,
"joro@...tes.org" <joro@...tes.org>,
"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
"robin.murphy@....com" <robin.murphy@....com>,
"baolu.lu@...ux.intel.com" <baolu.lu@...ux.intel.com>,
"cohuck@...hat.com" <cohuck@...hat.com>,
"eric.auger@...hat.com" <eric.auger@...hat.com>,
"nicolinc@...dia.com" <nicolinc@...dia.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"mjrosato@...ux.ibm.com" <mjrosato@...ux.ibm.com>,
"chao.p.peng@...ux.intel.com" <chao.p.peng@...ux.intel.com>,
"yi.y.sun@...ux.intel.com" <yi.y.sun@...ux.intel.com>,
"peterx@...hat.com" <peterx@...hat.com>,
"jasowang@...hat.com" <jasowang@...hat.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>,
"lulu@...hat.com" <lulu@...hat.com>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
"Duan, Zhenzhong" <zhenzhong.duan@...el.com>,
"joao.m.martins@...cle.com" <joao.m.martins@...cle.com>,
"Zeng, Xin" <xin.zeng@...el.com>,
"Zhao, Yan Y" <yan.y.zhao@...el.com>
Subject: Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)
On Mon, Dec 11, 2023 at 10:34:09PM +0700, Suthikulpanit, Suravee wrote:
> Currently, the AMD IOMMU driver allocates a DomainId per IOMMU group.
> One issue with this is when we have nested translation where we could end up
> with multiple devices (RIDs) sharing same PASID and the same hDomainID.
Which means you also create multiple GCR3 tables since those are
(soon) per-device and we end up with the situation I described for a
functional legitimate reason :( It is just wasting memory by
duplicating GCR3 tables.
> For example:
>
> - Host view
> Device1 (RID 1) w/ hDomainId 1
> Device2 (RID 2) w/ hDomainId 1
So.. Groups are another ugly mess that we may have to do something
more robust about.
The group infrastructure assumes that all devices in the group have
the same translation. This is not how the VM communicates, each member
of the group gets to have its own DTE and there are legitimate cases
where the DTEs will be different (even if just temporarily)
How to mesh this is not yet solved (most likely we need to allow group
members to have temporarily different translation). But in the long
run the group should definately not be providing the cache tag, the
driver has to be smarter than this.
I think we talked about this before.. For the AMD driver the v1 page
table should store the domainid in the iommu_domain and that value
should be used everywhere
For modes with a GCR3 table the best you can do is to de-duplicate the
GCR3 tables and assign identical GCR3 tables to identical domain ids.
Ie all devices in a group will eventually share GCR3 tables so they
can converge on the same domain id.
> - Guest view
> Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
> Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
>
> We should be able to workaround this by changing the way we assign hDomainId
> to be per-device for VFIO pass-through devices although sharing the same v1
> (stage-2) page table. This would look like.
As I said, this doesn't quite work since the VM could do other
things. The kernel must be aware of the vDomainID and must select an
appropriate hDomainID with that knowledge in mind, otherwise
multi-device-groups in guests are fully broken.
> - Guest view
> Pass-through Device1 (vRID 3) w/ vDomainID A + PASID 0
> Pass-through Device2 (vRID 4) w/ vDomainID B + PASID 0
>
> This should avoid the IOMMU TLB conflict. However, the invalidation would
> need to be done for both DomainId 1 and 2 when updating the v1 (stage-2)
> page table.
Which is the key problem, if the VM thinks it has only one vDomainID
the VMM can't split that into two hDomainID's and expect the viommu
acceleration will work - so we shouldn't try to make it work in SW
either, IMHO.
Jason
Powered by blists - more mailing lists