lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76c02ea6-f1c5-2772-419f-5ceb197fe904@intel.com>
Date:   Tue, 22 Jun 2021 08:50:52 -0700
From:   Dave Jiang <dave.jiang@...el.com>
To:     "Tian, Kevin" <kevin.tian@...el.com>,
        "Alex Williamson (alex.williamson@...hat.com)" 
        <alex.williamson@...hat.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Jason Gunthorpe <jgg@...dia.com>,
        "Dey, Megha" <megha.dey@...el.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "Pan, Jacob jun" <jacob.jun.pan@...el.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>, "Lu, Baolu" <baolu.lu@...el.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "Kumar, Sanjay K" <sanjay.k.kumar@...el.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        Kirti Wankhede <kwankhede@...dia.com>
Subject: Re: Virtualizing MSI-X on IMS via VFIO


On 6/22/2021 3:16 AM, Tian, Kevin wrote:
> Hi, Alex,
>
> Need your help to understand the current MSI-X virtualization flow in
> VFIO. Some background info first.
>
> Recently we are discussing how to virtualize MSI-X with Interrupt
> Message Storage (IMS) on mdev:
>          https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/
>
> IMS is a device specific interrupt storage, allowing an optimized and
> scalable manner for generating interrupts. idxd mdev exposes virtual
> MSI-X capability to guest but uses IMS entries physically for generating
> interrupts.
>
> Thomas has helped implement a generic ims irqchip driver:
>          https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/
>
> idxd device allows software to specify an IMS entry (for triggering
> completion interrupt) when submitting a descriptor. To prevent one
> mdev triggering malicious interrupt into another mdev (by specifying
> an arbitrary entry), idxd ims entry includes a PASID field for validation -
> only a matching PASID in the executed descriptor can trigger interrupt
> via this entry. idxd driver is expected to program ims entries with
> PASIDs that are allocated to the mdev which owns those entries.
>
> Other devices may have different ID and format to isolate ims entries.
> But we need abstract a generic means for programming vendor-specific
> ID into vendor-specific ims entry, without violating the layering model.
>
> Thomas suggested vendor driver to first register ID information (possibly
> plus the location where to write ID to) in msi_desc when allocating irqs
> (extend existing alloc function or via new helper function) and then have
> the generic ims irqchip driver to update ID to the ims entry when it's
> started up by request_irq().
>
> Then there are two questions to be answered:
>
>      1) How does vendor driver decide the ID to be registered to msi_desc?
>      2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?
>
> For the 1st open, there are two types of PASIDs on idxd mdev:
>
>      1) default PASID: one per mdev and allocated when mdev is created;
>      2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);
>
> If vIOMMU is not exposed, all ims entries of this mdev should be
> programmed with default PASID which is always available in mdev's
> lifespan.
>
> If vIOMMU is exposed and guest sva is enabled, entries used for sva
> should be tagged with sva PASIDs, leaving others tagged with default
> PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
> needs program guest sva PASIDs into virtual MSIX_PERM register (one
> per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
> by host idxd driver which then figure out which PASID to register to
> msi_desc (require PASID translation info via new /dev/iommu proposal).
>
> The guest driver is expected to update MSIX_PERM before request_irq().
>
> Now the 2nd open requires your help. Below is what I learned from
> current vfio/qemu code (for vfio-pci device):
>
>      0) Qemu doesn't attempt to allocate all irqs as reported by msix->
>          table_size. It is done in an dynamic and incremental way.
>
>      1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
>           allocating/enabling irqs given a set of vMSIX vectors [start, count]:
>
>          a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
>              specified vectors [start, count] via request_irq();
>          b) if irqs already allocated, enable irqs for specified vectors;
>          c) if irq already enabled, disable and re-enable irqs for specified
>               vectors because user may specify a different eventfd;
>
>      2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
>          DEVICE_SET_IRQS to enable vector#0, even though it's currently
>          masked by the guest. Interrupts are received by Qemu but blocked
>          from guest via mask/pending bit emulation. The main intention is
>          to enable physical MSI-X;
>
>      3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
>          DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
>          from the one provided in 2);
>
>      4) When guest unmasks vector#1, Qemu finds it's outside of allocated
>          vectors (only vector#0 now):
>
>          a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
>              irq for vector#0;
>
>          b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
>              irqs for both vector#0 and vector#1;
>
>       5) When guest unmasks vector#2, same flow in 4) continues.
>
>       ....
>
> If above understanding is correct, how is lost interrupt avoided between
> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> while from guest p.o.v this vector is actually unmasked? There must be
> a mechanism in place, but I just didn't figure it out...
>
> Given above flow is robust, mapping Thomas's model to this flow is
> straightforward. Assume idxd mdev has two vectors: vector#0 for
> misc/error interrupt and vector#1 as completion interrupt for guest
> sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:
>
>      2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
>          DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
>          used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
>          knows to register default PASID to msi_desc#0 when allocating irqs.
>          Then .startup() callback of ims irqchip is called to program default
>          PASID saved in msi_desc#0 to the target ims entry when request_irq().
>
>      3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
>          DEVICE_SET_IRQS to enable vector#0 again. Following same logic
>          as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
>          re-enable irq#0 via request_irq(). It's still default PASID being used
>          according to msi_desc#0.

Hi Kevin, slight correction here. Because vector#0 is emulated for idxd 
vdev, it has no IMS backing. So there is no msi_desc#0 for that vector. 
msi_desc#0 actually starts at vector#1 where IMS is allocated to back 
it. vector#0 does not go through request_irq(). It only has eventfd 
part. Everything you say is correct but starts at vector#1.


>
>      4) When guest unmasks vector#1, Qemu finds it's outside of allocated
>          vectors (only vector#0 now):
>
>          a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
>              irq for vector#0. msi_desc#0 is also freed.
>
>          b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
>              irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
>             has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
>             for sva. idxd driver registers default PASID to msix_desc#0 and
>             host PASID2 (translated from guest PASID1) to msix_desc#1 when
>             allocating irqs. Later when both irqs are enabled via request_irq(),
>             ims irqchip driver updates the target ims entries according to
>             msix_desc#0 and misx_desc#1 respectively.
>
> But this is specific to how Qemu virtualizes MSI-X today. What about it
> may change (or another device model) to allocate all table_size irqs
> when guest enables MSI-X capability? At that point we don't have valid
> MSIX_PERM content to register PASID info to msix_desc. Possibly what
> we really require is a separate helper function allowing driver to update
> msix_desc after irq allocation, e.g. when guest unmasks a vector...
>
> and do you see any other facets which are overlooked here?
>
> Thanks
> Kevin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ