lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 17 Oct 2006 00:45:39 +0200
From:	"Jesper Juhl" <jesper.juhl@...il.com>
To:	"Linus Torvalds" <torvalds@...l.org>
Cc:	"Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
	"Andrew Morton" <akpm@...l.org>
Subject: Re: Simple script that locks up my box with recent kernels

Ok, finally got to the end of the bisection (see below; quoting all of
my previous email since my concerns from that one are still valid).


On 09/10/06, Jesper Juhl <jesper.juhl@...il.com> wrote:
> Ok, some preliminary results on this before I go get some sleep + a
> working day tomorrow...
>
>
> On 07/10/06, Linus Torvalds <torvalds@...l.org> wrote:
> >
> >
> > On Sat, 7 Oct 2006, Jesper Juhl wrote:
> > >
> > > > Can I bother you to just bisect it?
> > >
> > > Sure, but it will take a little while since building + booting +
> > > starting the test + waiting for the lockup takes a fair bit of time
> > > for each kernel
> >
> > Sure. That said, we've tried to narrow down things that took hours or days
> > (under real loads, not some nice test-script) to reproduce, and while it
> > doesn't always work, the real problem tends to be if the problem case
> > isn't really reproducible. It sounds like yours is pretty clear-cut, and
> > that will make things much easier.
> >
>
> Yeah, it seems pretty clear-cut, but I'm a bit nervous that it may
> sometimes take longer than my observed 60min to reproduce, rendering
> my git-bisection less than perfect (more on that below).
>
>
> > > and also due to the fact that my git skills are pretty
> > > limited, but I'll figure it out (need to improve those git skills
> > > anyway) :-)
> >
> > "git bisect" in particular isn't that hard to use, and it will really do
> > a lot of heavy lifting for you.
> >
> (...)
> Thanks a lot for the tutorial, that really helped.
>
> For some reason I couldn't get git to accept 2.6.17.13 as a "good"
> starting point, so I used 2.6.17 instead, and the sha1 you gave me for
> 2.6.18-git15 as the "bad" starting point.
>
> Here's where I am right now (a log of what I've done) :
>
> [bisection start]
>
> Bisecting: 5188 revisions left to test after this
> [92164c5dd1ade33f4e90b72e407910de6694de49] USB: OHCI hub code unaligned access
>
> [git bisect good]
>
> Bisecting: 2567 revisions left to test after this
> [e41542f5167d6b506607f8dd111fa0a3e468ccb8] [DCCP]: Introduce dccp_probe
>
> [git bisect good]
>
> Bisecting: 1351 revisions left to test after this
> [b98adfccdf5f8dd34ae56a2d5adbe2c030bd4674] Merge
> master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
>
> [git bisect good]
>
> Bisecting: 635 revisions left to test after this
> [538d9d532b0e0320c9dd326a560b5a72d73f910d] irq: remove a extra line
>
> [git bisect good]
>
> Bisecting: 292 revisions left to test after this
> [db1a19b38f3a85f475b4ad716c71be133d8ca48e] Merge branch
> 'intelfb-patches' of
> master.kernel.org:/pub/scm/linux/kernel/git/airlied/intelfb-2.6
>
> [git bisect bad]
>
> Bisecting: 146 revisions left to test after this
> [1db27c11e9a0c6d659040ac0b7c64a339e248fa1] istallion: Remove private
> baud rate decoding, which is also broken in this case on some
> platforms
>
> [git bisect bad]
>
> Bisecting: 73 revisions left to test after this
> [3171a0305d62e6627a24bff35af4f997e4988a80] simplify update_times
> (avoid jiffies/jiffies_64 aliasing problem)
>
> [git bisect good]
>
> Bisecting: 37 revisions left to test after this
> [29b884921634e1e01cbd276e1c9b8fc07a7e4a90] set EXIT_DEAD state in
> do_exit(), not in schedule()
>
> [currently testing this kernel]
>
>
> Looking at "git bisect visualize" the current status is this :
>
> bisect/good: 3171a0305d62e6627a24bff35af4f997e4988a80
> bisect/bad: 1db27c11e9a0c6d659040ac0b7c64a339e248fa1
> Current bisect marker at: 29b884921634e1e01cbd276e1c9b8fc07a7e4a90
>
>
> I'm a little worried though that my results may not be completely reliable.
>
> There's no doubt that you can trust the kernels that I told git were
> "bad" since those resultet in a hang and there's just no getting
> around that. So we know for a fact that the bad commit is somewhere
> between my last found bad kernel and 2.6.17, what we don't know with
> the same amount of certainty is if the bad commit is between my last
> found good kernel and the last found bad one.
>
> What I'm worried about is the kernels I've marked as "good".  Before
> starting this run I had never experienced a hang if the kernel
> survived past the one hour mark, so I concluded that testing each
> kernel for 80min would be enough to prove it good or bad. This now
> seems to be not completely reliable since my second bad kernel
> happened to hang after ~2hrs. This happened since I forgot to check my
> computer after 80min and only came back to it some 3hrs later (I know
> the time it hung since I had a xterm doing   while true;do sleep
> 10;uptime;done  running, so I could check.
>
> This all means that my testing and concluding kernels were "good"
> after 80min of test runtime may not be 100% reliable.
>
> Is it useful for me to continue bisecting from the point I'm at, or
> should I reset from good==2.6.17 and bad==the_last_bad_commit_I_found
> ?   Or do you have a likely culprit I should try revoking?
>
> Whatever your answer it'll have to wait until tomorrow evening since
> I'm going to go get some sleep now, but please let me know what you'd
> like me to do ...
>

In the end, this is what git told me :

1db27c11e9a0c6d659040ac0b7c64a339e248fa1 is first bad commit
commit 1db27c11e9a0c6d659040ac0b7c64a339e248fa1
Author: Alan Cox <alan@...rguk.ukuu.org.uk>
Date:   Fri Sep 29 02:01:38 2006 -0700

    [PATCH] istallion: Remove private baud rate decoding, which is
also broken in this case on some platforms

    Signed-off-by: Alan Cox <alan@...hat.com>
    Signed-off-by: Andrew Morton <akpm@...l.org>
    Signed-off-by: Linus Torvalds <torvalds@...l.org>

:040000 040000 0fc700de5e78b39acc130d529cf59437e9242b68
884b27574b6c38a5fa952d09ca945b167e36db84 M  drivers


But, that doesn't make much sense, so I very strongly suspect that my
test case was not as reliable as I thought.
We can trust the commits I marked as 'bad' though since there's no
getting around a complete lockup of the box. So we know for sure now
that things broke between 2.6.17 and the commit above. But since that
commit makes no sense as the cause of the breakage it must be a case
of me having marked a kernel as 'good' that would eventually have
turned out bad if I'd run it longer :-(

Where do I go from here?   The problem is still there...    I'll test
2.6.19-rc2 tomorrow, but apart from that I don't know how to proceed
apart from trying to capture a sysrq+t dump when the box locks up...
any ideas?


-- 
Jesper Juhl <jesper.juhl@...il.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists