[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090324220111.GC29509@elte.hu>
Date: Tue, 24 Mar 2009 23:01:11 +0100
From: Ingo Molnar <mingo@...e.hu>
To: David Miller <davem@...emloft.net>
Cc: herbert@...dor.apana.org.au, r.schwebel@...gutronix.de,
torvalds@...ux-foundation.org, blaschka@...ux.vnet.ibm.com,
tglx@...utronix.de, a.p.zijlstra@...llo.nl,
linux-kernel@...r.kernel.org, kernel@...gutronix.de
Subject: Re: Revert "gro: Fix legacy path napi_complete crash",
* David Miller <davem@...emloft.net> wrote:
> From: Ingo Molnar <mingo@...e.hu>
> Date: Tue, 24 Mar 2009 21:54:44 +0100
>
> > * Ingo Molnar <mingo@...e.hu> wrote:
> >
> > > > Same forcedeth box i reported before. Config below. (note: if
> > > > you want to use it you need to run it through 'make oldconfig',
> > > > with all defaults accepted)
> > >
> > > Hm, i just had a test failure (hung interface) with this too.
> > >
> > > I'll go back to the original straight revert of "303c6a0: gro: Fix
> > > legacy path napi_complete crash", and will test it overnight - to
> > > establish a baseline of stability again. (to make sure there are
> > > no other bugs interacting)
> >
> > FYI, this plain revert is holding up fine in my tests so far - 50
> > random iterations - the previous one failed after 5 iterations.
>
> Something must be up with respect to letting interrupts in during
> certain windows of time, or similar.
>
> I'll take a look at this and hopefully Herbert or myself will be
> able to figure it out.
It definitely did not show usual patterns of bug behavior - i'd have
found it yesterday morning if it did.
I spent most of the time trying to find a reliable reproducer
.config and system. Sometimes the bug went away with a minor change
in the .config. Until today i didnt even suspect a mainline change
causing this.
Also, note that i have reduced the probability of UP kernels in my
randconfigs artificially to about 12.5% (it is 50% upstream). Still,
despite that measure, the 'best' .config i found was an UP config -
i dont think that's an accident. Also, i had to fully saturate the
target CPU over gigabit to hit the bug best.
Which suggests to me (empirically) that it's indeed a race and that
it needs a saturated system with lots of IRQs to trigger, and
perhaps that it needs saturated/overloaded network device queues and
complex userspace/softirq/hardirq interactions.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists