linux-kernel - Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZApJGwPjHhlDwTDV@nvidia.com>
Date:   Thu, 9 Mar 2023 17:01:15 -0400
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Jean-Philippe Brucker <jean-philippe@...aro.org>
Cc:     Nicolin Chen <nicolinc@...dia.com>, robin.murphy@....com,
        will@...nel.org, eric.auger@...hat.com, kevin.tian@...el.com,
        baolu.lu@...ux.intel.com, joro@...tes.org,
        shameerali.kolothum.thodi@...wei.com,
        linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, yi.l.liu@...el.com
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
 for ARM SMMUv3

On Thu, Mar 09, 2023 at 06:26:59PM +0000, Jean-Philippe Brucker wrote:
> On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
> > 
> > > Although we can keep the alloc and hardware info separate for each IOMMU
> > > architecture, we should try to come up with common invalidation methods.
> > 
> > The invalidation language is tightly linked to the actual cache block
> > and cache tag in the IOMMU HW design.
> 
> Concretely though, what are the incompatibilities between the HW designs?
> They all need to remove a range of TLB entries, using some address space
> tag. But if there is an actual difference I do need to know.

For instance the address space tags and the cache entires they match
to are wildly different.

ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
and incorporates the PASID into the cache tag.

AMD uses something called a DID that covers a different set of stuff
than the Intel DID, and it doesn't seem to work for nesting. AMD uses
PASID as the primary nested cache tag.

Superficially you can say all three have an ASID and you can have an
invalidate ASID Operation and make it "look" the same, but the actual
behavior is totally ill defined and the whole thing is utterly
obfuscated as to what does it actually MEAN.

But this doesn't matter for virtio. You have already got a spec that
defines invalidation in terms of virtio objects that sort of match
things like iommu_domains. I hope the virtio
VIRTIO_IOMMU_INVAL_S_DOMAIN is very well defined as to exactly what
objects a DOMAIN match applies to. The job of the hypervisor is to
translate that to whatever invalidation(s) the real HW requires.

ie virito is going to say "invalidate this domain" and expect the
hypervisor to spew it to every attached PASID if that is what the HW
requires (eg AMD), or do a single ASID invalidation (Intel, sometimes)

But when a vIOMMU gets a vDID or vPASID invalidation command it
doesn't mean the same thing as virtio. The driver must invalidate
exactly what the vIOMMU programming model says to invalidate because
the guest is going to spew more invalidations to cover what it
needs. Over invalidation would be a performance problem.

Exposing these subtle differences to userspace is necessary. What I do
not want is leaking those differences through an ill-defined "generic"
interface.

Even more importantly Intel and ARM should not have to fight about the
subtle implementation specific details of the specification of the
"generic" interface. If Intel needs something dumb to make their
viommu work well then they should simply be able to do it. I don't
want to spend 6 months of pointless arguing about language details in
an unnecessary "generic" spec.

> > The purpose of these interfaces is to support high performance full
> > functionality vIOMMUs of the real HW, we should not loose sight of
> > that goal.
> > 
> > We are actually planning to go futher and expose direct invalidation
> > work queues complete with HW doorbells to userspace. This obviously
> > must be in native HW format.
> 
> Doesn't seem relevant since direct access to command queue wouldn't use
> this struct.

The point is our design direction with iommufd is to expose the raw HW
to userspace, not to obfuscate it with ill defined generalizations.

> > Someone has to do the conversion. If you don't think virito should do
> > it then I'd be OK to add a type tag for virtio format invalidation and
> > put it in the IOMMU driver.
> 
> Implementing two invalidation formats in each IOMMU driver does not seem
> practical.

I don't see why.

The advantage of the kernel side is that the implementation is not
strong ABI. If we want to adjust and review the virtio invalidation
path as new HW comes along we can, so long as it is only in the
kernel.

On the other hand if we mess up the uABI for iommufd we are stuck with
it.

The safest and best uABI for iommufd is the HW native uABI because it,
almost by definition, cannot be wrong.

Anyhow, I'm still not very convinced adapting to virtio invalidation
format should be in iommu drivers. I think what you end up with for
virtio is that Intel/AMD have some nice common code to invalidate an
iommu_domain address range (probably even the existing invalidation
interface), and SMMUv3 is just totally different and special.

This is because SMMUv3 has no option to keep the PASID table in the
hypervisor so you are sadly forced to expose both the native ASID and
native PASID caches to the virtio protocol.

Given that the VM virtio driver has to have SMMUv3 specific code to
handle the CD table it must get, I don't see the problem with also
having SMMUv3 specific code in the hypervisor virtio driver to handle
invalidating based on the CD table.

Really, I want to see patches implementing all of this before we make
any decision on what the ops interface is for virtio-iommu kernel
side.

> > However, I don't know the rational for virtio-viommu, it seems like a
> > strange direction to me.
> 
> A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> all vendor IOMMUs, new architectures get vIOMMU mostly for free,

So your argument is you can implement a simple map/unmap API riding
on the common IOMMU API and this is portable?

Seems sensible, but that falls apart pretty quickly when we talk about
nesting.. I don't think we can avoid VMM components to set this up, so
it stops being portable. At that point I'm back to asking why not use
the real HW model?

> > All the iommu drivers have native command
> > queues. ARM and AMD are both supporting native command queues directly
> > in the guest, complete with a direct guest MMIO doorbell ring.
> 
> Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> registers). An SMMUv3 can optionally implement multiple command
> queues, though I don't know if they can be safely assigned to
> guests.

It is not standardized by ARM, but it can (and has) been done.

> For a lot of SMMUv3 implementations that have a single queue and for
> other architectures, we can do better than hardware emulation.

How is using a SW emulated virtio formatted queue better than using a
SW emulated SMMUv3 ECMDQ?

The vSMMUv3 driver controls what capabilites are shown to the guest it
can definitely create a ECMDQ enabled device and do something like
assign the 2ndary ECMDQs to hypervisor kernel SW queues the same way
virito does.

I don't think there is a very solid argument that virtio-iommu is
necessary to get faster invalidation.

> > If someone wants to optimize this I'd think the way to do it is to use
> > virtio like techniques to put SW command queue processing in the
> > kernel iommu driver and continue to use the HW native interface in the
> > VM.
> 
> I didn't get which kernel this is.

hypervisor kernel.

> > IMHO this was just unioning all the different invalidation types
> > together. It makes sense for something like virtio but it is
> > illogical/obfuscated as a user/kernel interface since it still
> > requires a userspace HW driver to understand what subset of the
> > invalidations are used on the actual HW.
> 
> As above, decoding arch-specific structures into generic ones is what an
> emulated IOMMU does,

No, it is what virtio wants to do. We are deliberately trying not to
do that for real accelerated HW vIOMMU emulators.

> and it doesn't make a performance difference in which
> format it forwards that to the kernel. The host IOMMU driver checks the
> guest request and copies them into the command queue. Whether that request
> comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> some generic structure, does not make a difference.

It is not the structure layouts that matter!

It is the semantic meaning of each request, on each unique piece of
hardware. We actually want to leak the subtle semantic differences to
userspace.

Doing that and continuing to give them the same command label is
exactly the kind of obfuscated ill defined "generic" interface I do
not want.

Jason