[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zkc1w6wp.fsf@xmission.com>
Date: Wed, 29 Feb 2012 15:34:30 -0800
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Keith Chew <keith.chew@...il.com>
Cc: linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"
Keith Chew <keith.chew@...il.com> writes:
> Hi Eric
>
>>
>> Historically a lot of issues have had to do with which cpu you are
>> entering the bios from. So you might try pinning your process
>> to differen cpus and see if you can make the failure more deterministic.
>>
>
> We are using a Celeron 575 uniprocessor, so we do not have the option
> to pin on another cpu. I have tried compiling the kernel in both UP
> and SMP configuration, but sadly both causes the hang.
Ok. That rules out a bunch of things, and emerengy_restart may not
be much different in practice.
>> Ugh. The other possibility is that there is an intermittent failure in
>> the hardware, that prevents the boot/reboot. Wrong values on pull-up
>> resistors have been known to cause that kind of thing.
>>
>
> Thank you very much for this pointer, will feed that back to the
> manufacturer and see if it will give them some clues. The original
> purpose for this reboot exercise was to ensure the software will
> handle a power failure without any OS/data corruptions. With this new
> discovery of unreliable reboot, the next worry is "If reboot is not
> reliable, is the boot process also susceptible to the same issue?". I
> have not rigged up any hardware to simulate a periodic full shutdown
> and boot up process, but will be planning to set this up next.
>
> Thanks again, if you have any other suggestions for us to try, I am
> all ears!
I would check with your BIOS folks and perhaps play with the kernel
option. The most reliable way to peform a reset is to trigger a board
reset by writing to 0xcf9 or a similar register. I expect your BIOS
does that and you can probably get the kernel to do that. I would
definitely test to see if you can write to the mostly standard
0xcf9 register directly from the kernel and trigger a reset directly.
Once past a reset and with a single cpu all of the failures will be
happening in the boot path. So the only possible points of failure
are in devices that are different between a soft reset and a power on
reset.
I would check to see if your board perhaps supports post codes or any
other debugging that will let you see where you are hanging.
It sounds like there is some very rare failure, that is going to be
a challenge to track down. I would definitely test more than one
motherboard to ensure that you can reproduce the problem on more
than one piece of hardware. Sometimes hardware is just broken.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists