linux-kernel - [patch] x86, mm: Fix size of numa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.1102241330270.28798@chino.kir.corp.google.com>
Date:	Thu, 24 Feb 2011 14:46:38 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Tejun Heo <tj@...nel.org>, Ingo Molnar <mingo@...e.hu>
cc:	Yinghai Lu <yinghai@...nel.org>, tglx@...utronix.de,
	"H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org
Subject: [patch] x86, mm: Fix size of numa_distance array

On Thu, 24 Feb 2011, Tejun Heo wrote:

> >> DavidR reported that x86/mm broke his numa emulation with 128M etc.
> >
> > That regression needs to be fixed. Tejun, do you know about that bug?
> 
> Nope, David said he was gonna look into what happened but never got
> back.  David?
> 

I merged x86/mm with Linus' tree, it booted fine without numa=fake but 
then panics with numa=fake=128M (and could only be captured by 
earlyprintk):

[    0.000000] BUG: unable to handle kernel paging request at ffff88007ff00000
[    0.000000] IP: [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
[    0.000000] PGD 1804063 PUD 7fefd067 PMD 7fefe067 PTE 0
[    0.000000] Oops: 0002 [#1] SMP 
[    0.000000] last sysfs file: 
[    0.000000] CPU 0 
[    0.000000] Modules linked in:
[    0.000000] 
[    0.000000] Pid: 0, comm: swapper Not tainted 2.6.38-x86-mm #1
[    0.000000] RIP: 0010:[<ffffffff818ffc15>]  [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
[    0.000000] RSP: 0000:ffffffff81801d28  EFLAGS: 00010006
[    0.000000] RAX: 0000000000000009 RBX: 00000000000001ff RCX: 0000000000000ff8
[    0.000000] RDX: 0000000000000008 RSI: 000000007feff014 RDI: ffffffff8199ed0a
[    0.000000] RBP: ffffffff81801dc8 R08: 0000000000001000 R09: 000000008199ed0a
[    0.000000] R10: 000000007feff004 R11: 000000007fefd000 R12: 00000000000001ff
[    0.000000] R13: ffff88007feff000 R14: ffffffff81801d28 R15: ffffffff819b7ca0
[    0.000000] FS:  0000000000000000(0000) GS:ffffffff818da000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: ffff88007ff00000 CR3: 0000000001803000 CR4: 00000000000000b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
[    0.000000] Stack:
[    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[    0.000000]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff818ffc6d>] numa_set_distance+0x24/0xac
[    0.000000]  [<ffffffff81901581>] numa_emulation+0x236/0x284
[    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
[    0.000000]  [<ffffffff8190020a>] initmem_init+0xe8/0x56c
[    0.000000]  [<ffffffff8104fa43>] ? native_apic_mem_read+0x9/0x13
[    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
[    0.000000]  [<ffffffff8190068e>] ? amd_numa_init+0x0/0x376
[    0.000000]  [<ffffffff818ffa69>] ? dummy_numa_init+0x0/0x66
[    0.000000]  [<ffffffff818f974f>] ? register_lapic_address+0x75/0x85
[    0.000000]  [<ffffffff818f1b86>] setup_arch+0xa29/0xae9
[    0.000000]  [<ffffffff81456552>] ? printk+0x41/0x47
[    0.000000]  [<ffffffff818eda0d>] start_kernel+0x8a/0x386
[    0.000000]  [<ffffffff818ed2a4>] x86_64_start_reservations+0xb4/0xb8
[    0.000000]  [<ffffffff818ed39a>] x86_64_start_kernel+0xf2/0xf9

That's this:

430		numa_distance_cnt = cnt;
431	
432		/* fill with the default distances */
433		for (i = 0; i < cnt; i++)
434			for (j = 0; j < cnt; j++)
435	===>			numa_distance[i * cnt + j] = i == j ?
436					LOCAL_DISTANCE : REMOTE_DISTANCE;
437		printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
438	
439		return 0;

We're overflowing the array and it's easy to see why:

        for_each_node_mask(i, nodes_parsed)
                cnt = i;
        size = ++cnt * sizeof(numa_distance[0]);

cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt.  
The following patch fixes the issue on top of x86/mm.

I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
numa_distance (and that's not __initdata).  Before these changes, we 
calculated numa_distance() using pxms without this additional mapping, is 
there any way to reduce this?  (Admittedly real NUMA machines with 512 
nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)



x86, mm: Fix size of numa_distance array

numa_distance should be sized like the SLIT, an NxN matrix where N is the
highest node id.  This patch fixes the calulcation to avoid overflowing
the array on the subsequent iteration.

Signed-off-by: David Rientjes <rientjes@...gle.com>
---
 arch/x86/mm/numa_64.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index cccc01d..abf0131 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
 
 	for_each_node_mask(i, nodes_parsed)
 		cnt = i;
-	size = ++cnt * sizeof(numa_distance[0]);
+	size = cnt * cnt * sizeof(numa_distance[0]);
 
 	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
 				      size, PAGE_SIZE);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/