linux-kernel - Re: RFC: starting a kernel-testers group for newbies

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.1.10.0805020859080.5994@woody.linux-foundation.org>
Date:	Fri, 2 May 2008 09:28:08 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	"Carlos R. Mafra" <crmafra2@...il.com>
cc:	Adrian Bunk <bunk@...nel.org>, Paul Mackerras <paulus@...ba.org>,
	Josh Boyer <jwboyer@...il.com>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>, davem@...emloft.net,
	linux-kernel@...r.kernel.org, jirislaby@...il.com,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: RFC: starting a kernel-testers group for newbies

On Fri, 2 May 2008, Carlos R. Mafra wrote:
> 
> So I would like to ask you what an user should do when facing what is
> probably a timing-related bug, as it appears I have the bad luck
> of hitting one.

Quite frankly, it will depend on the bug.

If it's *reliably* timing-related (which sounds crazy, but is not at all 
unheard of), it can be reliably bisected down to some totally unrelated 
commit that doesn't actually introduce the problem at all, but that 
reliably turns it on or off.

That can be very misleading, and can cause us to basically revert a good 
commit, only to not actually fix the bug (and possibly re-introduce the 
bug that the reverted commit tried to fix).

But sometimes it gives us a clue where the timing problem is. But quite 
frankly, that seems to be the exception rather than the rule.

There have been issues that literally seemed to depend on things like 
cacheline placement etc, where changing config options for code that was 
never actually even *run* would change timing just enough to show a bug 
pseudo-reliably or not at all.

The good news is that those timing issues are really quite rare. 

Tha bad news is that when they happen, they are almost totally 
undebuggable. 

> This same problem is still present with yesterday's git, and sometimes
> it hangs without hpet=disable and sometimes it doesn't. (And never
> with hpet=disable in the boot command line)

Hey, it may well be a HPET+NOHZ issue. But it could also be that HPET is 
the thing that just allows you to see the hang.

> And using vga=6 or vga=0x0364 makes a difference in the probability
> of hanging.

.. and yeah, these kinds of really odd and obviously totally unrelated 
issues are a sign of a bug that is either simply hardware instability or 
very subtly timing-related.

The reason I mention hardware instability is that there really are bugs 
that happen due to (for example) power supply instabilities. Brownouts 
under heavy load have been causes of problems, but perhaps surprisingly, 
so has _idle_ time thanks to sleep-states!

The latter is probably due to bad powr conditioning on the CPU power 
lines, where the huge current swings (going at high CPU power to low, and 
back again) not only have made soem motherboards "sing" (or "hum", 
depending on frequency) but also causes voltage instability and then 
the CPU crashes.

Am I saying that's the reason you see problems? Probably not. Most 
instabilities really are due to kernel bugs. But hardware instabilities do 
happen, and they can have these kinds of odd effects.

> I am just waiting -rc1 to be released to send an email with my
> problem again, as I am unable to debug this myself.
> I think this is ok from my part, right?

Yes. You've been a good bug reporter, and kept at it. It's not your fault 
that the bug is hard to pin down. 

Quite frankly, it does sound like the hang happens somewhere around the 

	hpet_init
	hpet_acpi_add
	hpet_resources
	hpet_resources: 0xfed00000 is busy

printk's you added (correct?) and we've had tons of issues with NO_HZ, so 
at a guess it is timer-related.

(And I assume it's stable if/once it gets past that boot hang issue? That 
tends to mean that it's not some hardware instability, it's literally our 
init code).

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/