[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YWe3zS4lIn8cj6su@yekko>
Date: Thu, 14 Oct 2021 15:53:33 +1100
From: David Gibson <david@...son.dropbear.id.au>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Liu Yi L <yi.l.liu@...el.com>, alex.williamson@...hat.com,
hch@....de, jasowang@...hat.com, joro@...tes.org,
jean-philippe@...aro.org, kevin.tian@...el.com, parav@...lanox.com,
lkml@...ux.net, pbonzini@...hat.com, lushenming@...wei.com,
eric.auger@...hat.com, corbet@....net, ashok.raj@...el.com,
yi.l.liu@...ux.intel.com, jun.j.tian@...el.com, hao.wu@...el.com,
dave.jiang@...el.com, jacob.jun.pan@...ux.intel.com,
kwankhede@...dia.com, robin.murphy@....com, kvm@...r.kernel.org,
iommu@...ts.linux-foundation.org, dwmw2@...radead.org,
linux-kernel@...r.kernel.org, baolu.lu@...ux.intel.com,
nicolinc@...dia.com
Subject: Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
>
> > > This means we cannot define an input that has a magic HW specific
> > > value.
> >
> > I'm not entirely sure what you mean by that.
>
> I mean if you make a general property 'foo' that userspace must
> specify correctly then your API isn't general anymore. Userspace must
> know if it is A or B HW to set foo=A or foo=B.
I absolutely agree. Which is exactly why I'm advocating that
userspace should request from the kernel what it needs (providing a
*minimum* of information) and the kernel satisfies that (filling in
the missing information as suitable for the platform) or outright
fails.
I think that is more robust across multiple platforms and usecases
than advertising a bunch of capabilities and forcing userspace to
interpret those to work out what it can do.
> Supported IOVA ranges are easially like that as every IOMMU is
> different. So DPDK shouldn't provide such specific or binding
> information.
Absolutely, DPDK should not provide that. qemu *should* provide that,
because the specific IOVAs matter to the guest. That will inevitably
mean that the request is more likely to fail, but that's a fundamental
tradeoff.
> > No, I don't think that needs to be a condition. I think it's
> > perfectly reasonable for a constraint to be given, and for the host
> > IOMMU to just say "no, I can't do that". But that does mean that each
> > of these values has to have an explicit way of userspace specifying "I
> > don't care", so that the kernel will select a suitable value for those
> > instead - that's what DPDK or other userspace would use nearly all the
> > time.
>
> My feeling is that qemu should be dealing with the host != target
> case, not the kernel.
>
> The kernel's job should be to expose the IOMMU HW it has, with all
> features accessible, to userspace.
See... to me this is contrary to the point we agreed on above.
> Qemu's job should be to have a userspace driver for each kernel IOMMU
> and the internal infrastructure to make accelerated emulations for all
> supported target IOMMUs.
This seems the wrong way around to me. I see qemu as providing logic
to emulate each target IOMMU. Where that matches the host, there's
the potential for an accelerated implementation, but it makes life a
lot easier if we can at least have a fallback that will work on any
sufficiently capable host IOMMU.
> In other words, it is not the kernel's job to provide target IOMMU
> emulation.
Absolutely not. But it *is* the kernel's job to let qemu do as mach
as it can with the *host* IOMMU.
> The kernel should provide truely generic "works everywhere" interface
> that qemu/etc can rely on to implement the least accelerated emulation
> path.
Right... seems like we're agreeing again.
> So when I see proposals to have "generic" interfaces that actually
> require very HW specific setup, and cannot be used by a generic qemu
> userpace driver, I think it breaks this model. If qemu needs to know
> it is on PPC (as it does today with VFIO's PPC specific API) then it
> may as well speak PPC specific language and forget about pretending to
> be generic.
Absolutely, the current situation is a mess.
> This approach is grounded in 15 years of trying to build these
> user/kernel split HW subsystems (particularly RDMA) where it has
> become painfully obvious that the kernel is the worst place to try and
> wrangle really divergent HW into a "common" uAPI.
>
> This is because the kernel/user boundary is fixed. Introducing
> anything generic here requires a lot of time, thought, arguing and
> risk. Usually it ends up being done wrong (like the PPC specific
> ioctls, for instance)
Those are certainly wrong, but they came about explicitly by *not*
being generic rather than by being too generic. So I'm really
confused aso to what you're arguing for / against.
> and when this happens we can't learn and adapt,
> we are stuck with stable uABI forever.
>
> Exposing a device's native programming interface is much simpler. Each
> device is fixed, defined and someone can sit down and figure out how
> to expose it. Then that is it, it doesn't need revisiting, it doesn't
> need harmonizing with a future slightly different device, it just
> stays as is.
I can certainly see the case for that approach. That seems utterly at
odds with what /dev/iommu is trying to do, though.
> The cost, is that there must be a userspace driver component for each
> HW piece - which we are already paying here!
>
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole. When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
>
> For instance, we are going to see HW with nested page tables, user
> space owned page tables and even kernel-bypass fast IOTLB
> invalidation.
> In that world does it even make sense for qmeu to use slow DMA_MAP
> ioctls for emulation?
Probably not what you want ideally, but it's a really useful fallback
case to have.
> A userspace framework in qemu can make these optimizations and is
> also necessarily HW specific as the host page table is HW specific..
>
> Jason
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists