lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231209014726.GA2945299@nvidia.com>
Date:   Fri, 8 Dec 2023 21:47:26 -0400
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Yi Liu <yi.l.liu@...el.com>,
        "Giani, Dhaval" <Dhaval.Giani@....com>,
        Vasant Hegde <vasant.hegde@....com>,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Cc:     joro@...tes.org, alex.williamson@...hat.com, kevin.tian@...el.com,
        robin.murphy@....com, baolu.lu@...ux.intel.com, cohuck@...hat.com,
        eric.auger@...hat.com, nicolinc@...dia.com, kvm@...r.kernel.org,
        mjrosato@...ux.ibm.com, chao.p.peng@...ux.intel.com,
        yi.y.sun@...ux.intel.com, peterx@...hat.com, jasowang@...hat.com,
        shameerali.kolothum.thodi@...wei.com, lulu@...hat.com,
        suravee.suthikulpanit@....com, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
        zhenzhong.duan@...el.com, joao.m.martins@...cle.com,
        xin.zeng@...el.com, yan.y.zhao@...el.com
Subject: Re: [PATCH v6 0/6] iommufd: Add nesting infrastructure (part 2/2)

On Fri, Nov 17, 2023 at 05:07:11AM -0800, Yi Liu wrote:

> Take Intel VT-d as an example, the stage-1 translation table is I/O page
> table. As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform the stage-1
> address translation. Along with it, modifications to present mappings in the
> guest I/O page table should be followed with an IOTLB invalidation.

I've been looking at what the three HW's need for invalidation, it is
a bit messy.. Here is my thinking. Please let me know if I got it right

What is the starting point of the guest memory walks:
 Intel: Single Scalable Mode PASID table entry indexed by a RID & PASID
 AMD: GCR3 table (a table of PASIDs) indexed by RID
 ARM: CD table (a table of PASIDs) indexed by RID

What key does the physical HW use for invalidation:
 Intel: Domain-ID (stored in hypervisor, per PASID), PASID
 AMD: Domain-ID (stored in hypervisor, per RID), PASID
 ARM: VMID (stored in hypervisor, per RID), ASID (stored in guest)

Why key does the VM use for invalidation:
 Intel: vDomain-ID (per PASID), PASID
 AMD: vDomain-ID (per RID), PASID
 ARM: ASID

What is in a Nested domain:
 Intel: A single IO page table refereed to by a PASID entry
        Each vDomain-ID,PASID allocates a unique nesting domain
 AMD: A GCR3 table pointer
      Nesting domains are created for every unique GCR3 pointer.
      vDomain-ID can possibly refer to multiple Nesting domains :(
 ARM: A CD table pointer
      Nesting domains are created for every unique CD table top pointer.

How does the hypervisor compute it's cache tag:
 Intel: Each time a nesting domain is attached to a (RID,PASID) the
        hypervisor driver will try to find a (DID,PASID) that is
	already used by this domain, or allocate a new DID.
 AMD: When creating the Nesting Domain the vDomain-ID should be passed
      in. The hypervisor driver will allocate a unique pDomain-ID for
      each vDomain-ID (hand wave). Several Nesting Domains will share
      the same p/vDomain-ID.
 ARM: The VMID is uniquely assigned to the Nesting Parent when it
      is allocated, in some configurations it has to be shared with
      KVM's VMID so allocating the Nesting Parent will require a KVM FD.

Will ATC be forwarded or synthesized:
 Intel: The (vDomain-ID,PASID) is a unique nesting domain so
        the hypervisor knows exactly which RIDs this nesting domain is
	linked to and can generate an ATC invalidation. Plan is to
	supress/discard the ATC invalidations from the VM and generate
	them in the hypervisor.
 AMD: (vDomain-ID,PASID) is ambiguous, it can refer to multiple GCR3
      tables. We know which maximal set of RIDs it represents, but not
      the actual set. I expect AMD will forward the ATC invalidation
      to avoid over invalidation.
 ARM: ASID is ambiguous. We have no idea which Nesting Domain/CD table
      the ASID is contained in. ARM must forward the ATC invalidation
      from the guest.

What iommufd object should receive the IOTLB invalidation command list:
 Intel: The Nesting domain. The command list has to be broken up per
        (vDomain-ID,PASID) and that batch delivered to the single
	nesting domain. Kernel ignores vDomain-ID/PASID and just
	invalidates whatever the nesting domain is actually attached to
 AMD: Any Nesting Domain in the vDomain-ID group. The command list has
      to be broken up per (vDomain-ID). Kernel replaces
      vDomain-ID with pDomain-ID from the nesting domain and executes
      the invalidation.
 ARM: The Nesting Parent domain. Kernel forces the VMID from the
      Nesting Parent and executes the invalidation.

In all cases the VM issues an ATC invalidation with (vRID, PASID) as
the tag. The VMM must translate vRID -> dev_id -> pRID

For a pure SW flow the vRID can be mapped to the dev_id and the ATC
invalidation delivered to the device object (eg IOMMUFD_DEV_INVALIDATE)

Finally, we have the HW driven invalidation DMA queues that can be
directly assigned to the guest. AMD and SMMUv3+vCMDQ support this. In
this case the HW is directly processing invalidation commands without
a hypervisor trap.

To make this work the iommu needs to be programmed with:
 AMD: A vDomain-ID -> pDomain-ID table
      A vRID -> pRID table
      This is all bound to some "virtual function"
 ARM: A vRID -> pRID table
      The vCMDQ is bound to a VM_ID, so to the Nesting Parent

For AMD, as above, I suggest the vDomain-ID be passed when creating
the nesting domain.

The AMD "virtual function".. It is probably best to create a new iommufd
object for this and it can be passed in to a few places

The vRID->pRID table should be some mostly common
IOMMUFD_DEV_ASSIGN_VIRTUAL_ID. AMD will need to pass in the virtual
function ID and ARM will need to pass in the Nesting Parent ID.

For the HW path some function will create the command queue and
DMA/mmap it. Taking in the virtual function/nesting parent as the
handle to associate it with.

For a SW path:
 AMD: All invalidations can be delivered to the virtual function
      and the kernel can use the vDomainID/vRID tables to translate
      them fully
 ARM: All invalidations can be delivered to the nesting parent

In many ways the nesting parent/virtual function are very similar
things. Perhaps ARM should also create a virtual function object which
is just welded to the nesting parent for API consistency.

So.. In short.. Invalidation is a PITA. The idea is the same but
annoying little details interfere with actually having a compltely
common API here. IMHO the uAPI in this series is fine. It will support
Intel invalidation and non-ATC invalidation on AMD/ARM. It should be
setup to allow that the target domain object can be any HWPT.

ARM will be able to do IOTLB invalidation using this API.

IOMMUFD_DEV_INVALIDATE should be introduced with the same design as
HWPT invalidate. This would be used for AMD/ARM's ATC invalidation
(and just force the stream ID, userspace must direct the vRID to the
correct dev_id).

Then in yet another series we can tackle the entire "virtual function"
vRID/pRID translation stuff when the mmapable queue thing is
introduced.

Thus next steps:
 - Respin this and lets focus on Intel only (this will be tough for
   the holidays, but if it is available I will try)
 - Get an ARM patch that just does IOTLB invalidation and add it to my
   part 3
 - Start working on IOMMUFD_DEV_INVALIDATE along with an ARM
   implementation of it
 - Reorganize the AMD RFC broadly along these lines and lets see it
   freshened up in the next months as well. I would like to see the
   AMD support structured to implement the SW paths in first steps and
   later add in the "virtual function" acceleration stuff. The latter
   is going to be complex.
 
Jason

Powered by blists - more mailing lists