[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <201203152208.04168.petr@tesarici.cz>
Date: Thu, 15 Mar 2012 22:08:03 +0800
From: Petr Tesařík <petr@...arici.cz>
To: Dave Jones <davej@...hat.com>
Cc: Yang Bai <hamo.by@...il.com>,
Fengguang Wu <fengguang.wu@...el.com>,
Linux Kernel <linux-kernel@...r.kernel.org>,
Fedora Kernel Team <kernel-team@...oraproject.org>,
kernel@...arici.cz
Subject: Re: inode->i_wb_list corruption.
Dne So 10. března 2012 02:00:15 Dave Jones napsal(a):
> (trimmed cc)
>
> On Sat, Mar 10, 2012 at 12:14:37AM +0800, Yang Bai wrote:
> > On Fri, Mar 9, 2012 at 11:19 PM, Dave Jones <davej@...hat.com> wrote:
> > > And with that, this arrived..
> > > https://bugzilla.redhat.com/show_bug.cgi?id=788433#c3
> > >
> > > I'm leaning strongly towards believing this is yet another case of
> > > i915 corrupting memory on resume.
> >
> > Nice catch. I am wondering
> > 1) why all lists being affected and
> > 2) why all list_head's prev being set to NULL.
> >
> > Any ideas?
>
> This is probably the same bug:
> https://bugzilla.kernel.org/show_bug.cgi?id=37142 Petr noticed that the
> corruption is 32 bytes getting zeroed at the beginning of a page.
>
> I think this may be responsible for a lot of different bugs that we've
> had reported.
>
> i915_drm_thaw is a deep nest of functions though, so this is going to be
> hard to track down where that write is coming from. Because the corruption
> seems to happen to pages that are already allocated, we probably can't
> even rely on DEBUG_PAGEALLOC, though it might be worth trying.
If it you believe it could be written by the CPU, I can try to catch the
instruction that writes to this memory. My plan is as follows:
Set up all the hardware debug registers to trap writes to the pages that are
likely to get corrupted. Remember, I've seen the corruption happen always
roughly in the same physical memory area.
I know, there are only 4 registers I can use, and the potential corruption
area is much larger than 4 pages, but with enough reboots, the chance is quite
high that I'll be lucky.
I haven't gone for that plan yet, because I thought the area was in fact
written to by someone else on the PCI bus, not the CPU. If nothing else, I can
verify that. ;-)
Dave, do you think the result of such testing would help you resolve the bug?
Petr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists