[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110719144451.79bc69ab@kryten>
Date: Tue, 19 Jul 2011 14:44:51 +1000
From: Anton Blanchard <anton@...ba.org>
To: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: mahesh@...ux.vnet.ibm.com, linux-kernel@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, mingo@...e.hu,
benh@...nel.crashing.org, torvalds@...ux-foundation.org
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
On Mon, 18 Jul 2011 23:35:56 +0200
Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> Anton, could you test the below two patches on that machine?
>
> It should make things boot again, while I don't have a machine nearly
> big enough to trigger any of this, I tested the new code paths by
> setting FORCE_SD_OVERLAP in /debug/sched_features. Although any review
> of the error paths would be much appreciated.
I get an oops in slub code:
NIP [c000000000197d30] .deactivate_slab+0x1b0/0x200
LR [c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c000000000199d94] .__slab_alloc+0xb4/0x5a0
[c00000000019ac98] .kmem_cache_alloc_node_trace+0xa8/0x260
[c00000000007eb70] .build_sched_domains+0xa60/0xb90
[c000000000a16a98] .sched_init_smp+0xa8/0x228
[c000000000a00274] .kernel_init+0x10c/0x1fc
[c00000000002324c] .kernel_thread+0x54/0x70
I'm guessing it's a result of some nodes not having any local memory.
but a bit surprised I'm not seeing it elsewhere.
Investigating.
> Also, could you send me the node_distance table for that machine? I'm
> curious what the interconnects look like on that thing.
Our node distances are a bit arbitrary (I make them up based on
information given to us in the device tree). In terms of memory we have
a maximum of three levels. To give some gross estimates, on chip memory
might be 30GB/sec, on node memory 10-15GB/sec and off node memory
5GB/sec.
The only thing we tweak with node distances is to make sure we go into
node reclaim before going off node:
/*
* Before going off node we want the VM to try and reclaim from the local
* node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
* With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
* 20, we never reclaim and go off node straight away.
*
* To fix this we choose a smaller value of RECLAIM_DISTANCE.
*/
#define RECLAIM_DISTANCE 10
Anton
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0: 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
1: 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
2: 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
3: 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
4: 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
5: 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
6: 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
7: 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
8: 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
9: 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
10: 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
11: 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
12: 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
13: 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
14: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
15: 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 40 40 40 40 0 0 0 0
16: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 40 40 40 40 0 0 0 0
17: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 40 40 40 40 0 0 0 0
18: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 40 40 40 40 0 0 0 0
19: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 40 40 40 40 0 0 0 0
20: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 40 40 40 40 0 0 0 0
21: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 40 40 40 40 0 0 0 0
22: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 40 40 40 40 0 0 0 0
23: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 40 40 40 40 0 0 0 0
24: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
25: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 10 20 20 20 0 0 0 0
29: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 10 20 20 0 0 0 0
30: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 10 20 0 0 0 0
31: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 20 20 10 0 0 0 0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists