[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55D302CA.9010703@cn.fujitsu.com>
Date: Tue, 18 Aug 2015 18:02:50 +0800
From: Tang Chen <tangchen@...fujitsu.com>
To: Jiang Liu <jiang.liu@...ux.intel.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mgorman@...e.de>,
David Rientjes <rientjes@...gle.com>,
Mike Galbraith <umgwanakikbuti@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
Tejun Heo <tj@...nel.org>
CC: Tony Luck <tony.luck@...el.com>, <linux-mm@...ck.org>,
<linux-hotplug@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<x86@...nel.org>, <tangchen@...fujitsu.com>
Subject: Re: [Patch V3 0/9] Enable memoryless node support for x86
On 08/17/2015 11:18 AM, Jiang Liu wrote:
> This is the third version to enable memoryless node support on x86
> platforms. The previous version (https://lkml.org/lkml/2014/7/11/75)
> blindly replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
> cpu_to_mem(). That's not the right solution as pointed out by Tejun
> and Peter due to:
> 1) We shouldn't shift the burden to normal slab users.
> 2) Details of memoryless node should be hidden in arch and mm code
> as much as possible.
>
> After digging into more code and documentation, we found the rules to
> deal with memoryless node should be:
> 1) Arch code should online corresponding NUMA node before onlining any
> CPU or memory, otherwise it may cause invalid memory access when
> accessing NODE_DATA(nid).
> 2) For normal memory allocations without __GFP_THISNODE setting in the
> gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
> numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
> information as pointed out by Tejun:
> A - B - X - C - D
> Where X is the memless node. numa_mem_id() on X would return
> either B or C, right? If B or C can't satisfy the allocation,
> the allocator would fallback to A from B and D for C, both of
> which aren't optimal. It should first fall back to C or B
> respectively, which the allocator can't do anymoe because the
> information is lost when the caller side performs numa_mem_id().
Hi Liu,
BTW, how is this A - B - X - C - D problem solved ?
I don't quite follow this.
I cannot tell the difference between numa_node_id()/cpu_to_node() and
numa_mem_id()/cpu_to_mem() on this point. Even with hardware topology
info, how could it avoid this problem ?
Isn't it still possible falling back to A from B and D for C ?
Thanks.
> 3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
> numa_node_id()/cpu_to_node() should be used if caller only wants to
> allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
> should be used if caller wants to allocate from the nearest node
> with memory.
> 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
> whether a page is allocated from the nearest node.
>
> Based on above rules, this patch set
> 1) Patch 1 is a bugfix to resolve a crash caused by socket hot-addition
> 2) Patch 2 replaces numa_mem_id() with numa_node_id() when __GFP_THISNODE
> isn't set in gfp_flags.
> 3) Patch 3-6 replaces numa_node_id()/cpu_to_node() with numa_mem_id()/
> cpu_to_mem() if caller wants to allocate from local node only.
> 4) Patch 7-9 enables support of memoryless node on x86.
>
> With this patch set applied, on a system with two sockets enabled at boot,
> one with memory and the other without memory, we got following numa
> topology after boot:
> root@...04sdp:~# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
> node 0 size: 15940 MB
> node 0 free: 15397 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> node 1 size: 0 MB
> node 1 free: 0 MB
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
>
> After hot-adding the third socket without memory, we got:
> root@...04sdp:~# numactl --hardware
> available: 3 nodes (0-2)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
> node 0 size: 15940 MB
> node 0 free: 15142 MB
> node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 cpus:
> node 2 size: 0 MB
> node 2 free: 0 MB
> node distances:
> node 0 1 2
> 0: 10 21 21
> 1: 21 10 21
> 2: 21 21 10
>
> Jiang Liu (9):
> x86, NUMA, ACPI: Online node earlier when doing CPU hot-addition
> kernel/profile.c: Replace cpu_to_mem() with cpu_to_node()
> sgi-xp: Replace cpu_to_node() with cpu_to_mem() to support memoryless
> node
> openvswitch: Replace cpu_to_node() with cpu_to_mem() to support
> memoryless node
> i40e: Use numa_mem_id() to better support memoryless node
> i40evf: Use numa_mem_id() to better support memoryless node
> x86, numa: Kill useless code to improve code readability
> mm: Update _mem_id_[] for every possible CPU when memory
> configuration changes
> mm, x86: Enable memoryless node support to better support CPU/memory
> hotplug
>
> arch/x86/Kconfig | 3 ++
> arch/x86/kernel/acpi/boot.c | 9 +++-
> arch/x86/kernel/smpboot.c | 2 +
> arch/x86/mm/numa.c | 59 +++++++++++++++----------
> drivers/misc/sgi-xp/xpc_uv.c | 2 +-
> drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
> drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
> kernel/profile.c | 2 +-
> mm/page_alloc.c | 10 ++---
> net/openvswitch/flow.c | 2 +-
> 10 files changed, 59 insertions(+), 34 deletions(-)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists