[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20110419005237.GA2040@neilslaptop.think-freely.org>
Date: Mon, 18 Apr 2011 20:52:37 -0400
From: Neil Horman <nhorman@...driver.com>
To: Ben Hutchings <bhutchings@...arflare.com>
Cc: Stephen Hemminger <shemminger@...tta.com>, netdev@...r.kernel.org,
davem@...emloft.net, Thomas Gleixner <tglx@...utronix.de>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Jeff Kirsher <jeffrey.t.kirsher@...el.com>
Subject: Re: net: Automatic IRQ siloing for network devices
On Mon, Apr 18, 2011 at 10:51:34PM +0100, Ben Hutchings wrote:
> On Sun, 2011-04-17 at 21:08 -0400, Neil Horman wrote:
> > On Sun, Apr 17, 2011 at 07:38:59PM +0100, Ben Hutchings wrote:
> > > On Sun, 2011-04-17 at 13:20 -0400, Neil Horman wrote:
> > > > On Sat, Apr 16, 2011 at 09:17:04AM -0700, Stephen Hemminger wrote:
> > > [...]
> > > > > My gut feeling is that:
> > > > > * kernel should default to a simple static sane irq policy without user
> > > > > space. This is especially true for multi-queue devices where the default
> > > > > puts all IRQ's on one cpu.
> > > > >
> > > > Thats not how it currently works, AFAICS. The default kernel policy is
> > > > currently that cpu affinity for any newly requested irq is all cpus. Any
> > > > restriction beyond that is the purview and doing of userspace (irqbalance or
> > > > manual affinity setting).
> > >
> > > Right. Though it may be reasonable for the kernel to use the hint as
> > > the initial affinity for a newly allocated IRQ (not sure quite how we
> > > determine that).
> > >
> > So I understand what your saying here, but I'm having a hard time reconciling
> > the two notions. Currently as it stands, affinity_hint gets set by a single
> > function call in the kernel (irq_set_affinity_hint), and is called by drivers
> > wishing to guide irqbalances behavior (currently only ixgbe does this). The
> > behavior a driver is capable of guiding however are either overly simple (ixgbe
> > just tells irqbalance to place each irq on a separate cpu, which irqbalance
> > would do anyway)
>
> It's a bit more subtle than that.
>
> ixgbe is trying to set up hardware flow steering. Some versions of the
> hardware can steer packets to RX queues based on the TX queue that was
> last used for the same flow. The TX queue selection based on CPU in
> ixgbe_select_queue() should be the inverse of the IRQ affinity mapping
> of RX queues, and the affinity hints are supposed to ensure that this is
> true.
>
Ah, ok, that makes a bit more sense then. Thank you for that.
> I think it should be possible to replace those hints with use of
> irq_cpu_rmap for TX queue selection.
>
> > or overly complex (forcing policy into the kernel, which I
> > tried to do with this patch series, but based on the responses I've gotten here,
> > that seems non-desireable).
>
> The trouble is that irqbalance has been so bad for multiqueue net
> devices in the past that many vendors (including Solarflare) recommended
> that it be disabled. I think irqbalance does sensible things now but
> many systems will be running without it for some time to come.
>
> I was thinking that if the drivers could set sane hints to start with
> then it would improve matters for those systems without irqbalance. But
> maybe it would be better still for some part of the networking core or
> IRQ core to set up a default spreading of multiqueue IRQs.
>
But doesn't this force policy for irqbalancing into the kernel, as Thomas and
Eric alluded to? It seems to me that, if we can export just a bit more
information regarding irqs and their associations to devices (which has been a
major achilles heel of irqblance in the past), then I think we can create a sane
default balancing policy with some simple udev rules. I've been messing with
this a bit today.
> [...]
> > > > Actually, as I read back to myself, that acutally sounds kind of good to me. It
> > > > keeps all the policy for this in user space, and minimizes what we have to add
> > > > to the kernel to make it happen (some process information in /proc and another
> > > > udev event). I'd like to get some feedback before I start implementing this,
> > > > but I think this could be done. What do you think?
> > >
> > > I don't think it's a good idea to override the scheduler dynamically
> > > like this.
> > >
> > Why not? Not disagreeing here, but I'm curious as to why you think this is bad.
> > We already have several interfaces for doing this in user space (cgroups and
> > taskset come to mind). Nominally they are used directly by sysadmins, and used
> > sparingly for specific configurations.
>
> Yes, that is why I think this is different.
>
Ok, fair enough.
> > All I'm suggesting is that we create a
> > daemon to identify processes that would benefit from running closer to the nics
> > they are getting data from, and restricting them to cpus that fit that benefit.
> > If a sysadmin doesn't want that behavior, they can stop the daemon, or change
> > its configuration to avoid including processes they don't want to move/restrict.
>
> I think this could improve latency under low CPU load and throughput
> under high CPU load for small numbers of relatively long-lived flows.
> But for large numbers of flows or high turnover of flows the affinity
> will just be noise.
>
> You're welcome to do your own experiments, obviously!
>
I will, but I'll start with the low hanging fruit. I'm going to try exporting
the msi table for a device. With that I can use the netdev_registration uevent
to properly identify network based irqs without the need for 1/2 assed regex
searches and volume counts and do one shot rebalancing of them.
Thanks for your time & thoughts!
Neil
> Ben.
>
> --
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists