[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110922135428.GC16740@parisc-linux.org>
Date:	Thu, 22 Sep 2011 07:54:28 -0600
From:	Matthew Wilcox <matthew@....cx>
To:	Neil Horman <nhorman@...driver.com>
Cc:	linux-kernel@...r.kernel.org, Greg Kroah-Hartman <gregkh@...e.de>,
	Jesse Barnes <jbarnes@...tuousgeek.org>,
	linux-pci@...r.kernel.org
Subject: Re: [PATCH] sysfs: add per pci device msi[x] irq listing (v3)
On Mon, Sep 19, 2011 at 11:47:15AM -0400, Neil Horman wrote:
> So a while back, I wanted to provide a way for irqbalance (and other apps) to
> definitively map irqs to devices, which, for msi[x] irqs is currently not really
> possible in user space.  My first attempt wen't not so well:
> https://lkml.org/lkml/2011/4/21/308
> 
> It was plauged by the same issues that prior attempts were, namely that it
> violated the one-file-one-value sysfs rule.  I wandered off but have recently
> come back to this.  I've got a new implementation here that exports a new
> subdirectory for every pci device,  called msi_irqs.  This subdirectory contanis
> a variable number of numbered subdirectories, in which the number represents an
> msi irq.  Each numbered subdirectory contains attributes for that irq, which
> currently is only the mode it is operating in (msi vs. msix).  I think fits
> within the constraints sysfs requires, and will allow irqbalance to properly map
> msi irqs to devices without having to rely on rickety, best guess methods like
> interface name matching.
This approach feels like building bigger rockets instead of a space
elevator :-)
What we need is to allow device drivers to ask for per-CPU interrupts,
and implement them in terms of MSI-X.  I've made a couple of stabs at
implementing this, but haven't got anything working yet.  It would solve
a number of problems:
1. NUMA cacheline fetch.  At the moment, desc->istate gets modified by
handle_edge_irq.  handle_percpu_irq doesn't need to worry about any
of that stuff, so doesn't touch desc->istate.  I've heard this is a
significant problem for the high-speed networking people.
2. /proc/interrupts is unmanagable on large machines.  There are hundreds
of interrupts and dozens of CPUs.  This would go a long way to reducing
the number of rows in the table (doesn't do anything about the columns).
ie instead of this:
 79:          0          0          0          0          0          0          0          0   PCI-MSI-edge      eth1
 80:          0          0    9275611          0          0          0          0          0   PCI-MSI-edge      eth1-TxRx-0
 81:          0          0    9275611          0          0          0          0          0   PCI-MSI-edge      eth1-TxRx-1
 82:          0          0          0          0    9275611          0          0          0   PCI-MSI-edge      eth1-TxRx-2
 83:          0          0          0          0    9275611          0          0          0   PCI-MSI-edge      eth1-TxRx-3
 84:          0          0          0          0          0    9275611          0          0   PCI-MSI-edge      eth1-TxRx-4
 85:          0          0          0          0          0    9275611          0          0   PCI-MSI-edge      eth1-TxRx-5
 86:          0          0          0          0          0          0    9275611          0   PCI-MSI-edge      eth1-TxRx-6
 87:          0          0          0          0          0          0    9275611          0   PCI-MSI-edge      eth1-TxRx-7
We'd get this:
 79:          0          0          0          0          0          0          0          0   PCI-MSI-edge      eth1
 80:    9275611    9275611    9275611    9275611    9275611    9275611    9275611    9275611   PCI-MSI-edge      eth1-TxRx
3. /proc/irq/x/smp_affinity actually makes sense again.  It can be a
mask of which interrupts are active instead of being a degenerate case
in which only the lowest set bit is actually honoured.
4. Easier to manage for the device driver.  All it needs is to call
request_percpu_irq(...) instead of trying to figure out how many
threads/cores/numa nodes/... there are in the machine, and how many
other multi-interrupt devices there are; and thus how many interrupts
it should allocate.  That can be left to the interrupt core which at
least has a chance of getting it right.
-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
