[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5d6222a80805090828j1a1af054y547640c3408bca1b@mail.gmail.com>
Date: Fri, 9 May 2008 12:28:32 -0300
From: "Glauber Costa" <glommer@...il.com>
To: "Hugh Dickins" <hugh@...itas.com>
Cc: "Theodore Ts'o" <tytso@....edu>,
"Glauber Costa" <gcosta@...hat.com>, "Ingo Molnar" <mingo@...e.hu>,
linux-kernel@...r.kernel.org
Subject: Re: Possible regression? 2.6.26-rc1: T61s failure after suspend/resume
On Thu, May 8, 2008 at 6:48 PM, Hugh Dickins <hugh@...itas.com> wrote:
> On Thu, 8 May 2008, Theodore Ts'o wrote:
> >
> > I'm running a kernel based off of commit afa26be8 (just six commits
> > after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
> > the Intel video chipset), the X server will lock up. I can ssh into
> > the machine remotely, and restart the X server, but the newly restarted
> > X server will shortly lock up again, and the only way to solve the
> > problem is to reboot. If I drop back to a 2.6.25 based kernel, the
> > problem goes away.
> >
> > I've tried bisecting it, but the bisection points picked by git don't
> > boot at all, and given that I'm travelling I havent had much time to try
> > doing more bisecting; since I know a number of kernel developers have
> > Lenovo X61 laptops, I thought before I wasted more time trying to get
> > the git bisection to work, I'd check to see if anyone has seen this
> > problem and if the fix is known. I'll also try the latest bleeding edge
> > kernel and hope it's fixed there....
>
> I don't have a Lenovo X61, and I've no problem on my uniprocessor T43p.
> But I also have a Fujitsu Siemens Esprimo Mobile, Core2 Duo and Intel
> graphics like yours, and that's been behaving strangely after resume
> from RAM since somewhere between 2.6.25 and 2.6.26-rc1.
>
> Sounds like it might be the same problem, though I quickly moved away
> trying it with X, and have been trying to investigate just from the
> console for some days now. Weird memory corruption after resume.
>
> Like you, little success with bisection: probably-other bugs get in
> the way. Some bisection points don't boot, some don't come back from
> resume at all, some hang before getting to test. When, as a working
> hypothesis, I assumed that not coming back from resume might be the
> same problem manifesting in the return from resume itself, and shifted
> around bisection points a bit to avoid non-booting, then it arrived at
>
> commit 4fe29a85642544503cf81e9cf251ef0f4e65b162
> Author: Glauber de Oliveira Costa <gcosta@...hat.com>
> Date: Wed Mar 19 14:25:23 2008 -0300
> x86: use specialized routine for setup per-cpu area
>
> as the suspect commit. But I couldn't see anything obviously wrong
> with that; and it could well be no more guilty than shifting around
> the kernel address space somewhat. I've rather given up on the
> bisection angle; and indeed, since found that how the problem
> manifests varies somewhat from one day's git to another,
> from one config to another.
>
> It does not happen with maxcpus=1. Yesterday it occurred to me
> to try without CONFIG_PREEMPT=y; but reached no conclusion on that,
> it turns out preemption has been somehow essential to resume from
> RAM on this machine since before 2.6.25: clearly a separate issue.
> And resume from RAM running 64-bit on it is also long problematic.
>
> To reproduce the problem, I start off by building a kernel with
> make -j3 (from habit, perhaps with priming the pagecache in mind),
> then interrupt that around the time it gets to filemap.o, bootmem.o.
> I pm-suspend, close the lid, wait a few seconds, open the lid;
> make mrproper and start a make -j3 build again. (Though the very
> first time I noticed the problem, it was a segfault in a git pull
> after resume.)
>
> How quickly it goes bad varies a lot: often hangs right at the
> start while sedding stuff before getting down to the build itself.
> Often gets well into the build before gcc reports Real-time signal
> (most commonly 14 but others seen) killed cc1. But my favourite,
> the most distinctive failure, is segfault (usually in sh or make)
> at 20295564 ip .....2f2 error 6 in ld-2.6.1.so (openSUSE 10.3).
>
> Always 20295564; and objdumping ld-2.6.1.so shows 0x14 of that is
> just the offset from %edi, so the crucial address is 0x20295550.
> Which is "PU) ", though I've not found that string anywhere in
> the running vmlinux (but of course it does appear in kernel source).
>
> Yesterday morning's git looked promising: because of the libata
> 70sec delay, I got diverted after the resume from RAM, left that
> laptop idle, and found hald-something-or-other had come in every
> few minutes and got that segfault at 20295564 (but with increasing
> ip addresses: some address-space randomization effect, I suppose).
> Well, I suppose it probably got run more often, but I'd only notice
> the segfaulting ones. So it can happen when close to idle; but
> I've not been able to reproduce that since.
>
> It's such a good signature, but I've failed to make progress with it.
> Ted, please try doing the same (and check your logs for existing
> segfault messages): let's see if you get the same number ;)
> though I've no idea what it'd tell us.
>
> Hugh
>
I can't reproduce it neither, and looking at the code over and over
again, see no obvious point for the breakage. I'll try to reproduce it
myself,
to see if I can spot something. But correct me if I'm wrong, this is
all 64-bit machines, right?
I'm stuck with mostly 32-bit hardware, but will give it a try anyway.
--
Glauber Costa.
"Free as in Freedom"
http://glommer.net
"The less confident you are, the more serious you have to act."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists