[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D66EA0A.1050405@kernel.org>
Date:	Thu, 24 Feb 2011 15:30:18 -0800
From:	Yinghai Lu <yinghai@...nel.org>
To:	David Rientjes <rientjes@...gle.com>
CC:	Tejun Heo <tj@...nel.org>, Ingo Molnar <mingo@...e.hu>,
	tglx@...utronix.de, "H. Peter Anvin" <hpa@...or.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [patch] x86, mm: Fix size of numa_distance array
On 02/24/2011 02:46 PM, David Rientjes wrote:
> On Thu, 24 Feb 2011, Tejun Heo wrote:
> 
>>>> DavidR reported that x86/mm broke his numa emulation with 128M etc.
>>>
>>> That regression needs to be fixed. Tejun, do you know about that bug?
>>
>> Nope, David said he was gonna look into what happened but never got
>> back.  David?
>>
> 
> I merged x86/mm with Linus' tree, it booted fine without numa=fake but 
> then panics with numa=fake=128M (and could only be captured by 
> earlyprintk):
> 
> [    0.000000] BUG: unable to handle kernel paging request at ffff88007ff00000
> [    0.000000] IP: [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
> [    0.000000] PGD 1804063 PUD 7fefd067 PMD 7fefe067 PTE 0
> [    0.000000] Oops: 0002 [#1] SMP 
> [    0.000000] last sysfs file: 
> [    0.000000] CPU 0 
> [    0.000000] Modules linked in:
> [    0.000000] 
> [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.38-x86-mm #1
> [    0.000000] RIP: 0010:[<ffffffff818ffc15>]  [<ffffffff818ffc15>] numa_alloc_distance+0x146/0x17a
> [    0.000000] RSP: 0000:ffffffff81801d28  EFLAGS: 00010006
> [    0.000000] RAX: 0000000000000009 RBX: 00000000000001ff RCX: 0000000000000ff8
> [    0.000000] RDX: 0000000000000008 RSI: 000000007feff014 RDI: ffffffff8199ed0a
> [    0.000000] RBP: ffffffff81801dc8 R08: 0000000000001000 R09: 000000008199ed0a
> [    0.000000] R10: 000000007feff004 R11: 000000007fefd000 R12: 00000000000001ff
> [    0.000000] R13: ffff88007feff000 R14: ffffffff81801d28 R15: ffffffff819b7ca0
> [    0.000000] FS:  0000000000000000(0000) GS:ffffffff818da000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: ffff88007ff00000 CR3: 0000000001803000 CR4: 00000000000000b0
> [    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [    0.000000] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180b020)
> [    0.000000] Stack:
> [    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> [    0.000000]  ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> [    0.000000]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffff818ffc6d>] numa_set_distance+0x24/0xac
> [    0.000000]  [<ffffffff81901581>] numa_emulation+0x236/0x284
> [    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
> [    0.000000]  [<ffffffff8190020a>] initmem_init+0xe8/0x56c
> [    0.000000]  [<ffffffff8104fa43>] ? native_apic_mem_read+0x9/0x13
> [    0.000000]  [<ffffffff81900a0a>] ? x86_acpi_numa_init+0x0/0x1b
> [    0.000000]  [<ffffffff8190068e>] ? amd_numa_init+0x0/0x376
> [    0.000000]  [<ffffffff818ffa69>] ? dummy_numa_init+0x0/0x66
> [    0.000000]  [<ffffffff818f974f>] ? register_lapic_address+0x75/0x85
> [    0.000000]  [<ffffffff818f1b86>] setup_arch+0xa29/0xae9
> [    0.000000]  [<ffffffff81456552>] ? printk+0x41/0x47
> [    0.000000]  [<ffffffff818eda0d>] start_kernel+0x8a/0x386
> [    0.000000]  [<ffffffff818ed2a4>] x86_64_start_reservations+0xb4/0xb8
> [    0.000000]  [<ffffffff818ed39a>] x86_64_start_kernel+0xf2/0xf9
> 
> That's this:
> 
> 430		numa_distance_cnt = cnt;
> 431	
> 432		/* fill with the default distances */
> 433		for (i = 0; i < cnt; i++)
> 434			for (j = 0; j < cnt; j++)
> 435	===>			numa_distance[i * cnt + j] = i == j ?
> 436					LOCAL_DISTANCE : REMOTE_DISTANCE;
> 437		printk(KERN_DEBUG "NUMA: Initialized distance table, cnt=%d\n", cnt);
> 438	
> 439		return 0;
> 
> We're overflowing the array and it's easy to see why:
> 
>         for_each_node_mask(i, nodes_parsed)
>                 cnt = i;
>         size = ++cnt * sizeof(numa_distance[0]);
> 
> cnt is the highest node id parsed, so numa_distance[] must be cnt * cnt.  
> The following patch fixes the issue on top of x86/mm.
> 
> I'm running on a 64GB machine with CONFIG_NODES_SHIFT == 10, so 
> numa=fake=128M would result in 512 nodes.  That's going to require 2MB for 
> numa_distance (and that's not __initdata).  Before these changes, we 
> calculated numa_distance() using pxms without this additional mapping, is 
> there any way to reduce this?  (Admittedly real NUMA machines with 512 
> nodes wouldn't mind sacrificing 2MB, but we didn't need this before.)
> 
> 
> 
> x86, mm: Fix size of numa_distance array
> 
> numa_distance should be sized like the SLIT, an NxN matrix where N is the
> highest node id.  This patch fixes the calulcation to avoid overflowing
> the array on the subsequent iteration.
> 
> Signed-off-by: David Rientjes <rientjes@...gle.com>
> ---
>  arch/x86/mm/numa_64.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
> index cccc01d..abf0131 100644
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -414,7 +414,7 @@ static int __init numa_alloc_distance(void)
>  
>  	for_each_node_mask(i, nodes_parsed)
>  		cnt = i;
> -	size = ++cnt * sizeof(numa_distance[0]);
> +	size = cnt * cnt * sizeof(numa_distance[0]);
should be
+	cnt++;
+	size = cnt * cnt * sizeof(numa_distance[0]);
>  
>  	phys = memblock_find_in_range(0, (u64)max_pfn_mapped << PAGE_SHIFT,
>  				      size, PAGE_SIZE);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
