[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200821201705.GA2811871@nvidia.com>
Date: Fri, 21 Aug 2020 17:17:05 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: LKML <linux-kernel@...r.kernel.org>, <x86@...nel.org>,
Marc Zyngier <maz@...nel.org>, Megha Dey <megha.dey@...el.com>,
Dave Jiang <dave.jiang@...el.com>,
Alex Williamson <alex.williamson@...hat.com>,
"Jacob Pan" <jacob.jun.pan@...el.com>,
Baolu Lu <baolu.lu@...el.com>,
Kevin Tian <kevin.tian@...el.com>,
Dan Williams <dan.j.williams@...el.com>,
Joerg Roedel <joro@...tes.org>,
<iommu@...ts.linux-foundation.org>, <linux-hyperv@...r.kernel.org>,
Haiyang Zhang <haiyangz@...rosoft.com>,
"Jon Derrick" <jonathan.derrick@...el.com>,
Lu Baolu <baolu.lu@...ux.intel.com>,
Wei Liu <wei.liu@...nel.org>,
"K. Y. Srinivasan" <kys@...rosoft.com>,
"Stephen Hemminger" <sthemmin@...rosoft.com>,
Steve Wahl <steve.wahl@....com>,
"Dimitri Sivanich" <sivanich@....com>, Russ Anderson <rja@....com>,
<linux-pci@...r.kernel.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
"Lorenzo Pieralisi" <lorenzo.pieralisi@....com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
<xen-devel@...ts.xenproject.org>, Juergen Gross <jgross@...e.com>,
Boris Ostrovsky <boris.ostrovsky@...cle.com>,
"Stefano Stabellini" <sstabellini@...nel.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING
On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> + struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> + u32 __iomem *ctrl = &slot->ctrl;
> >> +
> >> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
>
> We means NVIDIA and your new device, right?
We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)
> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.
Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.
Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.
Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.
The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.
To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.
> How is that supposed to work if interrupt remapping is disabled?
The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.
If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.
So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.
The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.
> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.
I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.
> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
>
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().
I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM.
Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.
Jason
Powered by blists - more mailing lists