linux-kernel - RE: [RFC] /dev/ioasid uAPI proposal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB1886469C0136C6523AB158B68C3B9@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Fri, 4 Jun 2021 08:38:26 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Alex Williamson <alex.williamson@...hat.com>,
        Jason Gunthorpe <jgg@...dia.com>
CC:     Jean-Philippe Brucker <jean-philippe@...aro.org>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        Jonathan Corbet <corbet@....net>,
        Robin Murphy <robin.murphy@....com>,
        LKML <linux-kernel@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        David Gibson <david@...son.dropbear.id.au>,
        Kirti Wankhede <kwankhede@...dia.com>,
        "David Woodhouse" <dwmw2@...radead.org>,
        Jason Wang <jasowang@...hat.com>
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <alex.williamson@...hat.com>
> Sent: Friday, June 4, 2021 5:44 AM
> 
> > Based on that observation we can say as soon as the user wants to use
> > an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> > share the IO page table with IOMMUs that do support DMA_PTE_SNP.

page table sharing between incompatible IOMMUs is not a critical
thing. I prefer to disallowing sharing in such case as the starting point,
i.e. the user needs to create separate IOASIDs for such devices.

> 
> If your goal is to prioritize IO page table sharing, sure.  But because
> we cannot atomically transition from one to the other, each device is
> stuck with the pages tables it has, so the history of the VM becomes a
> factor in the performance characteristics.
> 
> For example if device {A} is backed by an IOMMU capable of blocking
> no-snoop and device {B} is backed by an IOMMU which cannot block
> no-snoop, then booting VM1 with {A,B} and later removing device {B}
> would result in ongoing wbinvd emulation versus a VM2 only booted with
> {A}.
> 
> Type1 would use separate IO page tables (domains/ioasids) for these such
> that VM1 and VM2 have the same characteristics at the end.
> 
> Does this become user defined policy in the IOASID model?  There's
> quite a mess of exposing sufficient GET_INFO for an IOASID for the user
> to know such properties of the IOMMU, plus maybe we need mapping flags
> equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
> IOASID that could generate IOMMU faults, etc.

IOMMU_CACHE is a fixed attribute given an IOMMU. So it's better to
convey this info to userspace via GET_INFO for a device_label, before 
creating any IOASID. But overall I agree that careful thinking is required
about how to organize those info reporting (per-fd, per-device, per-ioasid)
to userspace.

> 
> > > > It doesn't solve the problem to connect kvm to AP and kvmgt though
> > >
> > > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > > the KVM fd to each device.  I think some devices might currently fail
> > > their open callback if that linkage isn't already available though, so
> > > it's not clear when that should happen, ie. it can't currently be a
> > > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > > proposal requires some availability of the vfio device fd without any
> > > setup, so presumably that won't yet call the driver open callback.
> > > Maybe that's part of the attach phase now... I'm not sure, it's not
> > > clear when the vfio device uAPI starts being available in the process
> > > of setting up the ioasid.  Thanks,
> >
> > At a certain point we maybe just have to stick to backward compat, I
> > think. Though it is useful to think about green field alternates to
> > try to guide the backward compat design..
> 
> I think more to drive the replacement design; if we can't figure out
> how to do something other than backwards compatibility trickery in the
> kernel, it's probably going to bite us.  Thanks,
> 

I'm a bit lost on the desired flow in your minds. Here is one flow based
on my understanding of this discussion. Please comment whether it
matches your thinking:

0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;

1) Qemu binds dev1 to ioasid_fd;

2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
     CACHE info i.e. whether underlying IOMMU can enforce snoop;

3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
    needs to figure out whether dev1 wants to do no-snoop. This might
    be based a fixed vendor/class list or specified by user;

4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
     flag is specified to decide the page table format, which is supposed
     to match dev1;

5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this 
     point, specify snoop/no-snoop again. If not supported by related 
     iommu or different from what gpa_ioasid has, attach fails.

6) call KVM to update the snoop requirement via KVM_UPADTE_IOASID_FD.
    this triggers ioasidfd_for_each_ioasid();

later when dev2 is attached to gpa_ioasid, same flow is followed. This
implies that KVM_UPDATE_IOASID_FD is called only when new IOASID is
created or existing IOASID is destroyed, because all devices under an 
IOASID should have the same snoop requirement.

Thanks
Kevin