linux-kernel - [BUG] x86: bootmem broken on SGI UV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101008213429.GB7223@sgi.com>
Date:	Fri, 8 Oct 2010 16:34:29 -0500
From:	Russ Anderson <rja@....com>
To:	linux-kernel <linux-kernel@...r.kernel.org>,
	Yinghai Lu <yinghai@...nel.org>, tglx@...utronix.de,
	"H. Peter Anvin" <h.peter.anvin@...el.com>
Cc:	Russ Anderson <rja@....com>, Jack Steiner <steiner@....com>
Subject: [BUG] x86: bootmem broken on SGI UV

	[BUG] x86: bootmem broken on SGI UV

Recent community kernels do not boot on SGI UV x86 hardware with
more than one socket.  I suspect the problem is due to recent 
bootmem/e820 changes.

What is happening is the e280 table defines a memory range.

 BIOS-e820: 0000000100000000 - 0000001080000000 (usable)

The SRAT table shows that memory range is spread over two nodes.

 SRAT: Node 0 PXM 0 100000000-800000000
 SRAT: Node 1 PXM 1 800000000-1000000000
 SRAT: Node 0 PXM 0 1000000000-1080000000

Previously, the kernel early_node_map[] would show three entries
with the proper node.
                                                                                
[    0.000000]     0: 0x00100000 -> 0x00800000
[    0.000000]     1: 0x00800000 -> 0x01000000
[    0.000000]     0: 0x01000000 -> 0x01080000

The problem is recent community kernel early_node_map[] shows 
only two entries with the node 0 entry overlapping the node 1
entry.

    0: 0x00100000 -> 0x01080000
    1: 0x00800000 -> 0x01000000

This results in the range 0x800000 -> 0x1000000 getting freed twice
(by free_all_memory_core_early()) resulting in nasty warnings.

 Queued invalidation will be enabled to support x2apic and Intr-remapping.
 BUG: Bad page state in process swapper  pfn:800000
 page:ffffea001c000000 count:0 mapcount:0 mapping:(null) index:0x0
 page flags: 0x60000000080000(buddy)
 Pid: 0, comm: swapper Not tainted 2.6.36-rc6-next-20101006-medusa #23
 Call Trace:
  [<ffffffff810a590f>] ? dump_page+0xc7/0xcc
  [<ffffffff810a61d8>] bad_page+0xeb/0xfd
  [<ffffffff810a6a27>] free_pages_prepare+0x68/0xa3
  [<ffffffff810a7a39>] __free_pages_ok+0x1d/0xf6
  [<ffffffff810a7cf8>] __free_pages+0x22/0x24
  [<ffffffff816c7871>] __free_pages_bootmem+0x55/0x57
  [<ffffffff816a7b48>] free_all_memory_core_early+0xeb/0x156
  [<ffffffff816a07a7>] numa_free_all_bootmem+0x7f/0x8b
  [<ffffffff81398dbc>] ? _etext+0x0/0x24
  [<ffffffff8169f5ea>] mem_init+0x1e/0xec
  [<ffffffff81689bea>] start_kernel+0x1c8/0x3ea
  [<ffffffff81689140>] ? early_idt_handler+0x0/0x71
  [<ffffffff816892af>] x86_64_start_reservations+0xb6/0xba
  [<ffffffff81689401>] x86_64_start_kernel+0x14e/0x15d
 Disabling lock debugging due to kernel taint

Shortly there after the kernel dies trying to get a page off the
freelist.

 Calibrating delay loop (skipped), value calculated using timer frequency.. 5333.99 BogoMIPS (lpj=10667996)
 pid_max: default: 32768 minimum: 301
 general protection fault: 0000 [#1] SMP
 last sysfs file:
 CPU 0
 Modules linked in:
 
 Pid: 0, comm: swapper Tainted: G    B       2.6.36-rc6-next-20101006-medusa #23 /Stoutland Platform
 RIP: 0010:[<ffffffff810a5557>]  [<ffffffff810a5557>] __rmqueue+0x7c/0x311
 RSP: 0000:ffffffff816019b8  EFLAGS: 00010002
 RAX: dead000000200200 RBX: dead000000200200 RCX: ffffea0037fe4e28
 RDX: dead000000100100 RSI: ffff880ffffda010 RDI: 0000000000000006
 RBP: ffffffff816019f8 R08: 0000000000000001 R09: ffff880ffffda078
 R10: 0000000000000000 R11: 0000000000000000 R12: dead000000100100
 R13: ffffea0037fe4e00 R14: 0000000000000286 R15: ffff880ffffd9e00
 FS:  0000000000000000(0000) GS:ffff880075c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000000 CR3: 0000000001604000 CR4: 00000000000006b0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process swapper (pid: 0, threadinfo ffffffff81600000, task ffffffff8160c020)
 Stack:
  0000000000000000 ffff880ffffd9e00 0000000000000000 ffffea0037fe5bf0
 <0> ffff880075c16158 0000000000000003 0000000000000286 ffff880ffffd9e00
 <0> ffffffff81601b28 ffffffff810a81d4 ffff88107ffd8e00 ffffffff81600000
 Call Trace:
  [<ffffffff810a81d4>] get_page_from_freelist+0x2f7/0x787
  [<ffffffff810a6bd7>] ? zone_watermark_ok+0x25/0xb8
  [<ffffffff810a8a3c>] __alloc_pages_nodemask+0x160/0x6ac
  [<ffffffff810a8a3c>] ? __alloc_pages_nodemask+0x160/0x6ac
  [<ffffffff811d7f2b>] ? cpumask_next_and+0x2d/0x3e
  [<ffffffff810d2aa2>] alloc_page_interleave+0x36/0x80
  [<ffffffff810d2bcf>] alloc_pages_current+0x7c/0xd1
  [<ffffffff8102f1f9>] __change_page_attr_set_clr+0x75f/0xc1f
  [<ffffffff810c8635>] ? vm_unmap_aliases+0x179/0x188
  [<ffffffff8102f830>] change_page_attr_set_clr+0x177/0x3ed
  [<ffffffff8102fe40>] ? set_memory_nx+0x3b/0x3d
  [<ffffffff8102fc2a>] set_memory_x+0x3b/0x3d
  [<ffffffff8169931b>] efi_enter_virtual_mode+0x22a/0x269
  [<ffffffff81689d99>] start_kernel+0x377/0x3ea
  [<ffffffff81689140>] ? early_idt_handler+0x0/0x71
  [<ffffffff816892af>] x86_64_start_reservations+0xb6/0xba
  [<ffffffff81689401>] x86_64_start_kernel+0x14e/0x15d
 Code: 0f 84 a9 00 00 00 48 8b 4c 0a 68 49 bc 00 01 10 00 00 00 ad de 48 bb 00 02 20 00 00 00 ad de 4c 8d 69 d8 49 8b 55 28 49 8b 45 30 <48> 89 42 08 48 89 10 4d 89 65 28 49 89 5d 30 0f ba 71 d8 13 b8
 RIP  [<ffffffff810a5557>] __rmqueue+0x7c/0x311
  RSP <ffffffff816019b8>
 ---[ end trace 4eaa2a86a8e2da22 ]---
 Kernel panic - not syncing: Attempted to kill the idle task!
   

I have not tracked the exact change that caused the regression, but suspect
it involves the recent bootmem/e820 changes.

Attached is the full boot output.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@....com

View attachment "output.2.6.36-rc6.fail" of type "text/plain" (105838 bytes)