[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64bb37e0801050652t7568e438uf93208601df84ef6@mail.gmail.com>
Date: Sat, 5 Jan 2008 15:52:32 +0100
From: "Torsten Kaiser" <just.for.lkml@...glemail.com>
To: "Jarek Poplawski" <jarkao2@...il.com>
Cc: "Herbert Xu" <herbert@...dor.apana.org.au>,
"Andrew Morton" <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org, "Neil Brown" <neilb@...e.de>,
"J. Bruce Fields" <bfields@...ldses.org>, netdev@...r.kernel.org,
"Tom Tucker" <tom@...ngridcomputing.com>
Subject: Re: 2.6.24-rc6-mm1
On Jan 5, 2008 11:13 AM, Jarek Poplawski <jarkao2@...il.com> wrote:
> On Sat, Jan 05, 2008 at 09:01:02AM +0100, Torsten Kaiser wrote:
> > On Jan 5, 2008 1:07 AM, Jarek Poplawski <jarkao2@...il.com> wrote:
> > > I think it would be easier just to start with this working -rc6 and
> > > simply check if we have 'right' suspects, so: git-net.patch and
> > > git-nfsd.patch from -mm1-broken-out, as suggested by Herbert (I hope,
> > > can compile - otherwise you could try the other way: add the whole -mm
> > > and revert these two). Using current gits could complicate this
> > > "investigation".
> >
> > OK, I will try this...
still on the todo-list, I had no time to try this yet...
> It seems that this last report gives the third one: ieee1394 to the pack,
> so probably, you can hold on a "minute" - this all needs some rethinking.
> (But, if you've begun with this already, let it be clear at last too.)
I don't think ieee1394 is to blame here. See http://lkml.org/lkml/2007/11/29/372
This was the first report of these crashes.
The first one is a similar crash in the ieee1394 code and my first try
was to blame it. But switching to a real network card did not solve
this, as the second crash in that mail shows.
Also Stefan Richter said in http://lkml.org/lkml/2007/11/29/419 this:
"FWIW, eth1394 and the entire rest of the 1394 stack beneath eth1394
are identical between -mm and Linus' tree."
I'm still using the old ieee1394-stack and not the new firewire one,
as eth1394 had not been ported at that time.
It might be possible that these are two different bugs, but two bugs
with same symptom's of corrupted lists at the same time seem unlikely.
(Especially this last report of the oops in 1394 looks rather
strange. Things can only go onto hpsbpkt_queue if they have a non NULL
complete_routine. (see queue_packet_complete() in
drivers/ieee1394/ieee1394_core.c). But a call to a NULL
complete_routine seems to be the cause of one of the two oopses. So it
looks like the hpsbpkt_queue list got mangled. But this list is only
used in this file and all three places that access this list are
protected by spinlocking pending_packets_lock.
So my personal conclusion would be, that someone is writing to memory
that he no longer owns. Most probably 0-bytes. (the complete_routine
got NULLed and the warning about dst->__refcnt being 0).
Use-after-free or something else?
[snip]
> > If you think some other slub_debug might catch it, I would try this...
>
> OK! But, in the meantime could you send your current .config?
Attached. (Last one I was using with 2.6.24-rc6-mm1. For all other
tests I copied this one and did a make oldconfig)
> I wonder
> e.g. if there could be used this new ieee1394 code from
> init_ohci1394_dma.c?
Interesting. I didn't even know about this file / option.
But four things make an involvement rather doubtful:
a) I do not find a single line like "init_ohci1394_dma: initializing
OHCI-1394" in any of the syslogs.
b) I do not have the parameter ohci1394_dma=early set
c) # CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
d) I have seen the crash in svc_xprt_enqueue() without eth1394 and at
that try there was not a single firewire device attached.
I will now try broken-out-patches...
Torsten
View attachment "2.6.24-rc6-mm1-config.txt" of type "text/plain" (51491 bytes)
Powered by blists - more mailing lists