linux-kernel - Re: 2.6.18-mm2 boot failure on x86-64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0610052128570.29014@skynet.skynet.ie>
Date:	Thu, 5 Oct 2006 21:39:15 +0100 (IST)
From:	Mel Gorman <mel@....ul.ie>
To:	Andi Kleen <ak@...e.de>
Cc:	vgoyal@...ibm.com, Steve Fox <drfickle@...ibm.com>,
	Badari Pulavarty <pbadari@...ibm.com>,
	Martin Bligh <mbligh@...igh.org>,
	Andrew Morton <akpm@...l.org>,
	lkml <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org,
	kmannth@...ibm.com, Andy Whitcroft <apw@...dowen.org>
Subject: Re: 2.6.18-mm2 boot failure on x86-64

On Thu, 5 Oct 2006, Andi Kleen wrote:

> On Thursday 05 October 2006 20:52, Vivek Goyal wrote:
>> On Thu, Oct 05, 2006 at 08:27:02PM +0200, Andi Kleen wrote:
>>> On Thursday 05 October 2006 19:57, Steve Fox wrote:
>>>> On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote:
>>>>
>>>>> Please don't snip the Code: line. It is fairly important.
>>>>
>>>> Sorry about that. The remote console I was using appears to overwrite
>>>> some text after I force the reboot. Here's a clean one.
>>>>
>>>> global ffffffffffffffff
>>>
>>> Ok that definitely shouldn't be in there.
>>>
>>> I guess we need to track when it gets corrupted. Can you send the full
>>> boot log with this patch applied?
>>>
>>
>> Just recalled one more observation about the problem when keith had
>> reported it last. If I just move .bss before .data_nosave instead
>> of it being at the end, keith's problem had disappeared.
>
> Yes, that could well be that it's something in the new bootmap
> management.  Steve's box failed at
>
> Using ACPI (MADT) for SMP configuration information
> Nosave address range: 000000000009a000 - 000000000009b000
> Nosave address range: 000000000009b000 - 00000000000a0000
> Nosave address range: 00000000000a0000 - 00000000000e0000
> Nosave address range: 00000000000e0000 - 0000000000100000
> Nosave address range: 00000000bff76000 - 00000000bff77000
> Nosave address range: 00000000bff77000 - 00000000bff98000
> Nosave address range: 00000000bff98000 - 00000000bff99000
> Nosave address range: 00000000bff99000 - 00000000c0000000
> Nosave address range: 00000000c0000000 - 00000000fec00000
> Nosave address range: 00000000fec00000 - 0000000100000000
> Allocating PCI resources starting at c4000000 (gap: c0000000:3ec00000)
> afinfo corrupted at init/main.c:512
>
> which is directly after that code does lots of stuff.
>
> Mel might want to take a look (and perhaps
> also cut down a little on the ugly printks ...)
>

Steve tested a patch with arch-independent zone-sizing backed out for 
x86_64 and things looked ok but that is no guarantee it is not a 
contributary factor. The "Nosave address range:" printks are related to a 
suspend problem that was reported .... end of June I believe.

I'll pick this up in the morning because I should have access to the same 
machine Steve does and see what I can come up with.

> BTW I found one of my test systems too now which does a lot of:
> I'm about to leave for vacation so i won't have time to track it down
> any time soon. But here is it for reference.
>

hmm, rather than bugging you with patches now, I'll see what I can find 
with the x86_64 machines I have access to and see can I reproduce it.

> -Andi
>
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Mapping aperture over 65536 KB of RAM @ 8000000
> Bad page state in process 'swapper'
> page:ffff810003ee5480 flags:0x0000000000000000 mapping:0000000000000000 mapcount:1 count:0
> Trying to fix it up, but a reboot is needed
> Backtrace:
>
> Call Trace:
> [<ffffffff8020ac84>] show_trace+0x34/0x47
> [<ffffffff8020aca9>] dump_stack+0x12/0x17
> [<ffffffff802586a7>] bad_page+0x57/0x81
> [<ffffffff80258791>] __free_pages_ok+0x64/0x247
> [<ffffffff807cca72>] free_all_bootmem_core+0xcc/0x1a9
> [<ffffffff807ca08b>] numa_free_all_bootmem+0x3b/0x77
> [<ffffffff807c915e>] mem_init+0x44/0x186
> [<ffffffff807bc5f0>] start_kernel+0x17b/0x207
> [<ffffffff807bc168>] _sinittext+0x168/0x16c
>
> Bad page state in process 'swapper'
> page:ffff810003ee54b8 flags:0x0000000000000000 mapping:0000000000000000 mapcount:1 count:0
> Trying to fix it up, but a reboot is needed
> Backtrace:
>
> Call Trace:
> [<ffffffff8020ac84>] show_trace+0x34/0x47
> [<ffffffff8020aca9>] dump_stack+0x12/0x17
> [<ffffffff802586a7>] bad_page+0x57/0x81
> [<ffffffff80258791>] __free_pages_ok+0x64/0x247
> [<ffffffff807cca72>] free_all_bootmem_core+0xcc/0x1a9
> [<ffffffff807ca08b>] numa_free_all_bootmem+0x3b/0x77
> [<ffffffff807c915e>] mem_init+0x44/0x186
> [<ffffffff807bc5f0>] start_kernel+0x17b/0x207
> [<ffffffff807bc168>] _sinittext+0x168/0x16c
>
>
> ... lots more of those ...
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/