linux-kernel - RE: [RFC] /dev/ioasid uAPI proposal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB1886DC8ECF5D56FE485D13D58C3C9@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Thu, 3 Jun 2021 03:22:27 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Alex Williamson <alex.williamson@...hat.com>,
        Jason Gunthorpe <jgg@...dia.com>
CC:     Jean-Philippe Brucker <jean-philippe@...aro.org>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        Jonathan Corbet <corbet@....net>,
        Robin Murphy <robin.murphy@....com>,
        LKML <linux-kernel@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        David Gibson <david@...son.dropbear.id.au>,
        Kirti Wankhede <kwankhede@...dia.com>,
        "David Woodhouse" <dwmw2@...radead.org>,
        Jason Wang <jasowang@...hat.com>
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <alex.williamson@...hat.com>
> Sent: Thursday, June 3, 2021 10:51 AM
> 
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <jgg@...dia.com> wrote:
> 
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right.  I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
> 
> Yup.  Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop?  This seems largely a
> legacy thing.

Yes, there is indeed performance benefit for device to use no-snoop,
e.g. 8K display and some imaging processing path, etc. The problem is
that the IOMMU for such devices is typically a different one from the
default IOMMU for most devices. This special IOMMU may not have
the ability of enforcing snoop on no-snoop PCI traffic then this fact
must be understood by KVM to do proper mtrr/pat/wbinvd virtualization 
for such devices to work correctly.

> 
> > > This support existed before mdev, IIRC we needed it for direct
> > > assignment of NVIDIA GPUs.
> >
> > Probably because they ignored the disable no-snoop bits in the control
> > block, or reset them in some insane way to "fix" broken bioses and
> > kept using it even though by all rights qemu would have tried hard to
> > turn it off via the config space. Processing no-snoop without a
> > working wbinvd would be fatal. Yeesh
> >
> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> and !IOMMU_CAP_CACHE_COHERENCY
> >    domains.
> >
> >    This doesn't actually matter. If you mix them together then kvm
> >    will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> >    anywhere in this VM.
> >
> >    This if two IOMMU's are joined together into a single /dev/ioasid
> >    then we can just make them both pretend to be
> >    !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> 
> Yes and no.  Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available.  That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.
> 
> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> >    What we want to do here is iterate over every ioasid associated
> >    with the group fd that is passed into kvm.
> 
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.
> 
> >    Today the group fd has a single container which specifies the
> >    single ioasid so this is being done trivially.
> >
> >    To reorg we want to get the ioasid from the device not the
> >    group (see my note to David about the groups vs device rational)
> >
> >    This is just iterating over each vfio_device in the group and
> >    querying the ioasid it is using.
> 
> The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> anyway, we still need to account for all the RIDs and aliases of a
> group.
> 
> >    Or perhaps more directly: an op attaching the vfio_device to the
> >    kvm and having some simple helper
> >          '(un)register ioasid with kvm (kvm, ioasid)'
> >    that the vfio_device driver can call that just sorts this out.
> 
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.  Register a notifier
> callback via the ioasidfd?  Thanks,
> 

When discussing I/O page fault support in another thread, the consensus
is that an device handle will be registered (by user) or allocated (return
to user) in /dev/ioasid when binding the device to ioasid fd. From this 
angle we can register {ioasid_fd, device_handle} to KVM and then call 
something like ioasidfd_device_is_coherent() to get the property. 
Anyway the coherency is a per-device property which is not changed 
by how many I/O page tables are attached to it.

Thanks
Kevin