linux-kernel - Re: [BUG] Linux 2.6.28.3 freezing on a 32-bits x86 Thinkpad T43p

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090204211759.GK22608@elte.hu>
Date:	Wed, 4 Feb 2009 22:17:59 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>, Greg KH <greg@...ah.com>,
	ltt-dev@...ts.casi.polymtl.ca, linux-kernel@...r.kernel.org
Subject: Re: [BUG] Linux 2.6.28.3 freezing on a 32-bits x86 Thinkpad T43p


* Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:

> Hi,
> 
> I've started experiencing freezes on my uniprocessor laptop with a
> 2.6.28.2/2.6.28.3 kernel with the LTTng patchset applied
> (http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=shortlog;h=2.6.28.3-lttng-0.88).
> Instrumentation is dynamically disabled when this happens, so it's
> unlikely that the LTTng patches would be causing this problem.
> 
> It happens when I work in X. The keyboard and mouse stop responding, and
> the machine stops answering to the network. It may take a few days to
> reproduce, and happens randomly when I actively use the computer (e.g
> surfing with firefox).
> 
> I managed to install a 50' serial cable through my appartment to capture
> the following OOPS. It points to a NULL pointer dereference in
> kernel/timer.c:cascade(). My config has hrtimers and no_hz activated.
> I suspect a race with with timer base lock or interrupt disabling
> protecting the timer base.
> 
> Any idea what is going on with the timers here ? In the meantime, I'll
> try to enable more debugging options to get more information when the
> problem reappears.

hm, it would be nice to know which timer got corrupted. It could possibly 
have gotten kfreed, reallocated, overwritten - and crashes things like this.

There's two ways to debug such things more directly:

1) enable CONFIG_PAGEALLOC=y. These days its plenty fast and its overhead 
   cannot be noticed.

2) enable DEBUGOBJECTS - you also need 'debugobjects' on the boot line for 
   this to be activated. This will report such corruptions sooner and in a 
   more specific way.

3) any particular reason why you have:

    # CONFIG_DEBUG_KERNEL is not set

   There's a number of goodies in that menu. CONFIG_LIST_DEBUG=y for 
   example.

It is highly unlikely that the timer list code is the culprit here - it has 
not changed in ages and it is very intensively used by all subsystems so 
breakages in it get found and reported very, very quickly.

btw., your stacktrace also has this:

>  [<c1010000>] kvm_mmu_pte_write+0xb0/0xa60

So in theory there could be some kvm induced memory corruption as well.

Hope this helps,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/