[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20080424200100.GA2900@ami.dom.local>
Date: Thu, 24 Apr 2008 22:01:00 +0200
From: Jarek Poplawski <jarkao2@...il.com>
To: Tuomas Jormola <tj@...itudo.net>
Cc: netdev@...r.kernel.org
Subject: Re: PROBLEM: A set of networking related oopses
On Thu, Apr 24, 2008 at 05:25:59PM +0300, Tuomas Jormola wrote:
> Hi again,
>
> On Sun, Mar 09, 2008 at 06:31:22PM +0100, Jarek Poplawski wrote:
> > On Sun, Mar 09, 2008 at 06:58:47PM +0200, Tuomas Jormola wrote:
> > ...
> > > there be new oopses, I will replace the old card with a newer Intel
> > > gigabit card that I have laying around, and put it in a different PCI
> > > slot.
> >
> > The link I gave you described similar problem just with e1000.
> > The next message after this thread looks alike (e1000 driver).
> > So, you shouldn't hurry with this change. Just set this affinity
> > for both cards and check if it's respected.
> I've now run my system about a month with the following configuration. I
> replaced the very old e100 card with a newer e1000 PCI card and set
> affinity so that interrupts for the IRQs of both e1000e and e1000 cards
> are handled by a single CPU, and this is working very well.
>
> (17:15:13)(tj@...kti)(~)$ grep eth /proc/interrupts
> 18: 88113407 3780 IO-APIC-fasteoi uhci_hcd:usb1, uhci_hcd:usb6, eth0
> 217: 9710797 4297 PCI-MSI-edge eth1
>
> (This is after about a 8 days of uptime, the affinity was set
> automatically in a local init script)
BTW, you could also try if setting affinity to different processors
works for you, i.e. irq 18 to cpu1 and irq 217 to cpu 2 (like described
in the earlier mentioned link).
> And with this, I've gotten rid of the OOPSes I had earlier. But is this
> really a feasible long term solution to the problem? I.e. if you're
> getting networking related OOPSes with SMP kernel on a box with two or
> more CPUs, the first thing you should do is to switch off the interrupt
> handling load balacing between the CPUs by issuing some obscure statment
> on the command line? I don't think that's very friendly advice for so
> called regular users... There's no way to work around it on the kernel
> side?
I looks like there are still attempts to fix this issue. Here is a
link to an interesting thread on this subject:
http://groups.google.com/group/linux.kernel/browse_thread/thread/6079876757758daa/43d38042acd9fb73?lnk=raot
Probably regular users shouldn't have such problems if they use
friendly distros.
> Also after installing the e1000 card, I've gotten a few of these dumps
> (see attachments) from the e1000 driver (during about a month, a dozen
> incidents, sometimes there might be 3 incidents a day, sometimes it
> takes a week when everything's normal.
Alas I'm not e1000 expert (this balancing advice is rather a general
issue). I've seen similar Tx hang reports, but it seems there could be
various reasons. Probably some of these could be fixed in current
kernels - did you try 2.6.25 BTW? Here is a case when turning off TSO
helped with something similar:
http://bugzilla.kernel.org/show_bug.cgi?id=9808
So, if you still have these problems with current kernels and you are
willing to help in debugging this you should probably report this in
bugzilla too.
Regards,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists