linux-kernel - Re: [patch RFC 38/38] irqchip: Add IMS array driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200821201705.GA2811871@nvidia.com>
Date:   Fri, 21 Aug 2020 17:17:05 -0300
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     LKML <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
        Marc Zyngier <maz@...nel.org>, Megha Dey <megha.dey@...el.com>,
        Dave Jiang <dave.jiang@...el.com>,
        Alex Williamson <alex.williamson@...hat.com>,
        "Jacob Pan" <jacob.jun.pan@...el.com>,
        Baolu Lu <baolu.lu@...el.com>,
        Kevin Tian <kevin.tian@...el.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Joerg Roedel <joro@...tes.org>,
        <iommu@...ts.linux-foundation.org>, <linux-hyperv@...r.kernel.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "Jon Derrick" <jonathan.derrick@...el.com>,
        Lu Baolu <baolu.lu@...ux.intel.com>,
        Wei Liu <wei.liu@...nel.org>,
        "K. Y. Srinivasan" <kys@...rosoft.com>,
        "Stephen Hemminger" <sthemmin@...rosoft.com>,
        Steve Wahl <steve.wahl@....com>,
        "Dimitri Sivanich" <sivanich@....com>, Russ Anderson <rja@....com>,
        <linux-pci@...r.kernel.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
        "Lorenzo Pieralisi" <lorenzo.pieralisi@....com>,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        <xen-devel@...ts.xenproject.org>, Juergen Gross <jgross@...e.com>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        "Stefano Stabellini" <sstabellini@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> +	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> +	u32 __iomem *ctrl = &slot->ctrl;
> >> +
> >> +	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
> 
> We means NVIDIA and your new device, right?

We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)

> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.

Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.

The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.

To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.

> How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.

> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
> 
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().

I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM. 

Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.

Jason