linux-kernel - RE: Virtualizing MSI-X on IMS via VFIO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB1886811339F7873A8E34549A8C089@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Wed, 23 Jun 2021 06:12:05 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Alex Williamson <alex.williamson@...hat.com>
CC:     Jason Gunthorpe <jgg@...dia.com>,
        "Dey, Megha" <megha.dey@...el.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "Pan, Jacob jun" <jacob.jun.pan@...el.com>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>, "Lu, Baolu" <baolu.lu@...el.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "Kumar, Sanjay K" <sanjay.k.kumar@...el.com>,
        LKML <linux-kernel@...r.kernel.org>, KVM <kvm@...r.kernel.org>,
        Kirti Wankhede <kwankhede@...dia.com>,
        "Peter Zijlstra" <peterz@...radead.org>,
        Marc Zyngier <maz@...nel.org>,
        "Bjorn Helgaas" <helgaas@...nel.org>
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Thomas Gleixner <tglx@...utronix.de>
> Sent: Wednesday, June 23, 2021 7:59 AM
> 
[...]
> I only can assume that back then the main problem was vector exhaustion
> on the host and to avoid allocating memory for interrupt descriptors
> etc, right?
> 
> The host vector exhaustion problem was that each MSIX vector consumed a
> real CPU vector which is a limited resource on X86. This is not longer
> the case today:
> 
>     1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
>        the allocatable range independent of the number of MSIX vectors
>        it allocates, unless it is in multi-queue managed mode where it
>        will actually reserve a vector (maybe two) per CPU.
> 
>        But for devices which are not using that mode, they just
>        opportunistically "reserve" vectors.
> 
>        All entries are initialized with a special system vector which
>        when raised will emit a nastigram in dmesg.
> 
>     2) request_irq() actually allocates a CPU vector from the
>        allocatable vector space which can obviously still fail, which is
>        perfectly fine.
> 
> So the only downside today of allocating more MSI-X vectors than
> necessary is memory consumption for the irq descriptors.

Curious about irte entry when IRQ remapping is enabled. Is it also
allocated at request_irq()?

> 
> Though for virtualization there is still another problem:
> 
>   Even if all possible MSI-X vectors for a passthrough PCI device would
>   be allocated upfront independent of the actual usage in the guest,
>   then there is still the issue of request_irq() failing on the host
>   once the guest decides to use any of those interrupts.
> 
> It's a halfways reasonable argumentation by some definition of
> reasonable, that this case would be a host system configuration problem
> and the admin who overcommitted is responsible for the consequence.
> 
> Where the only reasonable consequence is to kill the guest right there
> because there is no mechanism to actually tell it that the host ran out
> of resources.
> 
> Not at all a pretty solution, but it is contrary to the status quo well
> defined. The most important aspect is that it is well defined for the
> case of success:
> 
>   If it succeeds then there is no way that already requested interrupts
>   can be lost or end up being redirected to the legacy PCI irq due to
>   clearing the MSIX enable bit, which is a gazillion times better than
>   the "let's hope it works" based tinkerware we have now.

fair enough.

> 
> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
> out, even thinking about proliferating this with IMS is bonkers.
> 
> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
> in order to expand the number of vectors. The use cases are going to be
> even more dynamic than the usual device drivers, so the lost interrupt
> issue will be much more likely to trigger.
> 
> So no, we are not going to proliferate this complete ignorance of how
> MSI-X actually works and just cram another "feature" into code which is
> known to be incorrect.
> 

So the correct flow is like below:

    guest::enable_msix()
      trapped_by_host()
        pci_alloc_irq_vectors(); // for all possible vMSI-X entries
          pci_enable_msix();      

    guest::unmask()
      trapped_by_host()
        request_irqs();

the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.

the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
does request_irq() if specified irqs have been allocated.

Then map ims to this flow:

    guest::enable_msix()
      trapped_by_host()
        msi_domain_alloc_irqs(); // for all possible vMSI-X entries
        for_all_allocated_irqs(i)
          pci_update_msi_desc_id(i, default_pasid); // a new helper func

    guest::unmask(entry#0) 
      trapped_by_host()
        request_irqs();
          ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry

    guest::set_msix_perm(entry#1, guest_sva_pasid)
      trapped_by_host()
        pci_update_msi_desc_id(1, host_sva_pasid);

    guest::unmask(entry#1)
      trapped_by_host()
        request_irqs();
          ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry

Does above match your thoughts?

Thanks
Kevin