[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.11.1511031001130.4032@nanos>
Date: Tue, 3 Nov 2015 12:42:02 +0100 (CET)
From: Thomas Gleixner <tglx@...utronix.de>
To: Keith Busch <keith.busch@...el.com>
cc: LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
linux-pci@...r.kernel.org, Jiang Liu <jiang.liu@...ux.intel.com>,
Dan Williams <dan.j.williams@...el.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Bryan Veal <bryan.e.veal@...el.com>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, Martin Mares <mj@....cz>,
Jon Derrick <jonathan.derrick@...el.com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC PATCHv3 3/4] x86/pci: Initial commit for new VMD device
driver
Keith,
On Tue, 3 Nov 2015, Keith Busch wrote:
> On Mon, Nov 02, 2015 at 07:35:15PM +0100, Thomas Gleixner wrote:
> > On Tue, 27 Oct 2015, Keith Busch wrote:
> > > +config HAVE_VMDDEV
> > > + bool
> >
> > Why do you need a seperate config symbol here?
>
> This is to split the kernel vs. module parts. "vmd_create_irq_domain"
> needs to be in-kernel, but VMD may be compiled as a module, so can't use
> "CONFIG_VMD_DEV".
#ifdef IS_ENABLED(CONFIG_VMD_DEV)
is true for both "m" and "y"
> Alternatively we could export pci_msi_domain_ops, and no additional
> in-kernel dependencies are required.
That might be not the worst choice.
> > > + int count = 0;
> > > + struct msi_desc *msi_desc;
> > > + struct irq_domain *vmd_irqdomain, *msi_irqdomain;
> > > +
> > > + if (!dev->msix_enabled)
> > > + return NULL;
> > > +
> > > + for_each_pci_msi_entry(msi_desc, dev)
> > > + count += msi_desc->nvec_used;
> >
> > Is this the number of vectors which are available for the VMD device
> > itself?
>
> Correct, this is counting the number of the VMD vectors itself. This is
> not the number of vectors that may be allocated in this domain, though
> they will all need to multiplex into one of these.
But you actually limit the number of vectors in that domain:
> > > + for_each_pci_msi_entry(msi_desc, dev)
> > > + count += msi_desc->nvec_used;
> > > + vmd_irqdomain = irq_domain_add_linear(NULL, count,
> > > + domain_ops, dev);
So vmd_irqdomain can only hand out "count" vectors.
The vmd_irqdomain is the parent of the msi_irqdomain:
> > > + msi_irqdomain = pci_msi_create_irq_domain(NULL, &pci_chained_msi_domain_info,
> > > + vmd_irqdomain);
But that parent limitation does not matter simply because your
msi_irqdomain does not follow down the hierarchy in the allocation
path.
So we can avoid the vmd_irqdomain creation completely. It's just
wasting memory and has no value at all. Creating the msi domain with a
NULL parent is possible.
> > > + vmdirq->irq = &vmd->irqs[virq % vmd->msix_count];
> >
> > So that points to the underlying VMD device irq, right? And you spread
> > the device interrupts across the available VMD irqs. So in the worst
> > case you might end up with empty and crowded VMD irqs, right?
> > Shouldn't you try to fill that space in a more intelligent way?
>
> The domain is enabled and enumerated by pci, so I thought devices will
> be bound to their drivers serially, creating a more-or-less incrementing
> virtual irq that should wrap in evenly.
>
> But yes, we can be smarter about this, and the above is probably flawed
> rationale, especially if considering hot-plug.
Not only hotplug. It also depends on the init ordering and the
population of the interrupt descriptor space.
> > > +static void vmd_flow_handler(struct irq_desc *desc)
> > > +{
> > > + struct irq_chip *chip = irq_desc_get_chip(desc);
> > > + struct vmd_irq_list *irqs = irq_desc_get_handler_data(desc);
> > > + struct vmd_irq *vmdirq;
> > > +
> > > + chained_irq_enter(chip, desc);
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(vmdirq, &irqs->irq_list, node)
> > > + generic_handle_irq(vmdirq->virq);
> > > + rcu_read_unlock();
> >
> > I'm not entirely sure that the rcu protection here covers the whole
> > virq thing when an interrupt is torn down. Can you please explain the
> > whole protection scheme in detail?
>
> This protects against device drivers manipulating the list through
> vmd_msi_activate/deactivate. We can't spinlock the flow handler access
> due to desc->lock taken through generic_handle_irq. The same lock is
> held prior to vmd's msi activate/deactivate, so the lock order would be
> opposite in those two paths.
>
> I don't believe we're in danger of missing a necessary event or triggering
> an unexpected event here. Is the concern that we need to synchronize rcu
> instead of directly freeing the "vmdirq"? I think I see that possibility.
synchronize rcu does not help. You surely need to kfree_rcu()
"vmdirq", but that alone is not sufficient. You also need to think
about the underlying irq descriptor, which is not freed via rcu. Lets
look at it:
CPU0 CPU1
driver_exit()
free_irq(D)
lock(irqdesc(D))
irq_shutdown(D)
irq_domain_deactivate_irq(D) vmd_flow_handler()
vmd_msi_deactivate() rcu_read_lock()
lock() vmdirq = list_entry(...)
list_del_rcu() ---> NMI
unlock()
unlock(irqdesc(D))
synchronize_irq(D)
pci_misx_disable(device)
irq_domain_free_irqs()
vmd_msi_free_irqs()
kfree_rcu(vmdirq) <--- NMI done
D = vmdirq->virq; <- Protected by RCU
generic_handle_irq(D)
desc = irqdesc(D)
irq_free_descs()
kfree(irqdesc(D))
desc->handle_irq(desc) <-- KABOOOOM!
Subtle, isn't it? It's extremly unlikely, but when it happens you have
no chance to ever debug that.
The good news for you is that this is a general issue and you can
ignore it for now. The bad news for me is, that I have another urgent
issue on my ever growing todo list.
> > > +static void vmd_remove(struct pci_dev *dev)
> > > +{
> > > + struct vmd_dev *vmd = pci_get_drvdata(dev);
> > > +
> > > + pci_set_drvdata(dev, NULL);
> > > + sysfs_remove_link(&vmd->dev->dev.kobj, "domain");
> > > + pci_stop_root_bus(vmd->bus);
> > > + pci_remove_root_bus(vmd->bus);
> > > + vmd_teardown_dma_ops(vmd);
> > > + irq_domain_remove(vmd->irq_domain);
> >
> > What tears down the chained handlers and the MSIx entries?
>
> Thanks, I will fix the chained handler in both cases mentioned.
>
> The entries themselves should be freed through devres_release after this
> exits, and after MSI-x is disabled via pcim_release.
Ok.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists