linux-kernel - Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230310121603.GA1745536@myrica>
Date:   Fri, 10 Mar 2023 12:16:03 +0000
From:   Jean-Philippe Brucker <jean-philippe@...aro.org>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Nicolin Chen <nicolinc@...dia.com>, robin.murphy@....com,
        will@...nel.org, eric.auger@...hat.com, kevin.tian@...el.com,
        baolu.lu@...ux.intel.com, joro@...tes.org,
        shameerali.kolothum.thodi@...wei.com,
        linux-arm-kernel@...ts.infradead.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, yi.l.liu@...el.com
Subject: Re: [PATCH v1 02/14] iommufd: Add nesting related data structures
 for ARM SMMUv3

On Thu, Mar 09, 2023 at 05:01:15PM -0400, Jason Gunthorpe wrote:
> > Concretely though, what are the incompatibilities between the HW designs?
> > They all need to remove a range of TLB entries, using some address space
> > tag. But if there is an actual difference I do need to know.
> 
> For instance the address space tags and the cache entires they match
> to are wildly different.
> 
> ARM uses a fine grained ASID and Intel uses a shared ASID called a DID
> and incorporates the PASID into the cache tag.
> 
> AMD uses something called a DID that covers a different set of stuff
> than the Intel DID, and it doesn't seem to work for nesting. AMD uses
> PASID as the primary nested cache tag.

Thanks, we'll look into that


> This is because SMMUv3 has no option to keep the PASID table in the
> hypervisor so you are sadly forced to expose both the native ASID and
> native PASID caches to the virtio protocol.

It is possible to keep the PASID table in the host, but you need a way to
allocate GPA since the SMMU accesses it after stage-2 translation. I think
that necessarily requires a PV interface, but you could look into it.
Anyway, even with that, ATC invalidations take a PASID.

> 
> Given that the VM virtio driver has to have SMMUv3 specific code to
> handle the CD table it must get, I don't see the problem with also
> having SMMUv3 specific code in the hypervisor virtio driver to handle
> invalidating based on the CD table.

There isn't much we can't do, I'm just hoping to build something
straightforward instead of having to work around awkward interfaces


> > A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
> > all vendor IOMMUs, new architectures get vIOMMU mostly for free,
> 
> So your argument is you can implement a simple map/unmap API riding
> on the common IOMMU API and this is portable?
> 
> Seems sensible, but that falls apart pretty quickly when we talk about
> nesting.. I don't think we can avoid VMM components to set this up, so
> it stops being portable. At that point I'm back to asking why not use
> the real HW model?

A single VMM component that shovels data from the virtqueue to the kernel
API and back, rather than four different hardware emulations, four
different queues, four different device tables. It's obviously better for
VMMs that don't do full-system emulation like QEMU, especially as they
generally already implement a virtio transport. Smaller attack surface,
fewer bugs.

The VMM developer gets a multi-platform vIOMMU without having to study all
the different architecture manuals. There is a small amount of HW specific
data in there, but it only relates to table formats. 

Ideally it wouldn't need any HW knowledge, but that would requires the
APIs to be aligned: instead of ID registers we pass plain features, and
invalidations don't require HW specific opcodes. Otherwise there is going
to be a layer of glue everywhere, which is what I'm trying to avoid here.

> 
> > > All the iommu drivers have native command
> > > queues. ARM and AMD are both supporting native command queues directly
> > > in the guest, complete with a direct guest MMIO doorbell ring.
> > 
> > Arm SMMUv3 mandates a single global command queue (SMMUv2 uses
> > registers). An SMMUv3 can optionally implement multiple command
> > queues, though I don't know if they can be safely assigned to
> > guests.
> 
> It is not standardized by ARM, but it can (and has) been done.
> 
> > For a lot of SMMUv3 implementations that have a single queue and for
> > other architectures, we can do better than hardware emulation.
> 
> How is using a SW emulated virtio formatted queue better than using a
> SW emulated SMMUv3 ECMDQ?

We don't need to repeat it for all IOMMU architectures, not emulate a new
queue in the kernel. The first motivator for virtio-iommu was avoiding to
emulate hardware in the kernel. The SMMU maintainer saw how painful that
was to do for the GIC, saw that there is a virtualization queue readily
available in vhost and, well, it just made sense. Still does.


> > As above, decoding arch-specific structures into generic ones is what an
> > emulated IOMMU does,
> 
> No, it is what virtio wants to do. We are deliberately trying not to
> do that for real accelerated HW vIOMMU emulators.

Yes there is a line somewhere, and I'd prefer it to be the page table.
Given how many possible hardware combinations exist and how many more will
show up, it would be good to abstract things where possible.

> 
> > and it doesn't make a performance difference in which
> > format it forwards that to the kernel. The host IOMMU driver checks the
> > guest request and copies them into the command queue. Whether that request
> > comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
> > some generic structure, does not make a difference.
> 
> It is not the structure layouts that matter!
> 
> It is the semantic meaning of each request, on each unique piece of
> hardware. We actually want to leak the subtle semantic differences to
> userspace.

These are hardware emulations, of course they have to know about hardware
semantics. The QEMU IOMMUs can work in TCG mode where they decode and
handle everything themselves.

Thanks,
Jean