[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1562300143-11671-2-git-send-email-kernelfans@gmail.com>
Date: Fri, 5 Jul 2019 12:15:43 +0800
From: Pingfan Liu <kernelfans@...il.com>
To: x86@...nel.org
Cc: Pingfan Liu <kernelfans@...il.com>, Michal Hocko <mhocko@...e.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Mike Rapoport <rppt@...ux.ibm.com>,
Tony Luck <tony.luck@...el.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Oscar Salvador <osalvador@...e.de>,
Pavel Tatashin <pavel.tatashin@...rosoft.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Michael Ellerman <mpe@...erman.id.au>,
Stephen Rothwell <sfr@...b.auug.org.au>, Qian Cai <cai@....pw>,
Barret Rhoden <brho@...gle.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
David Rientjes <rientjes@...gle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: [PATCH 2/2] x86/numa: instance all parsed numa node
I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
is used to speed up kdump process, so it is not a rare case.
It turns out that some pgdat is not instanced when specifying nr_cpus, e.g, on
x86, not initialized by init_cpu_to_node()->init_memory_less_node(). But
device->numa_node info is used as preferred_nid param for
__alloc_pages_nodemask(), which causes NULL reference ac->zonelist =
node_zonelist(preferred_nid, gfp_mask);
Although this bug is detected on x86, it should affect all archs, where a
machine with a numa-node having no memory, if nr_cpus prevents the instance of
the node, and the device on the node tries to allocate memory with
device->numa_node info.
The patch takes the way by instancing all parsed numa node on x86. (for more
detail, please refer to section I and II)
I. Notes about the crashing info:
-1 kexec -l with nr_cpus=4
-2 system info
NUMA node0 CPU(s): 0,8,16,24
NUMA node1 CPU(s): 2,10,18,26
NUMA node2 CPU(s): 4,12,20,28
NUMA node3 CPU(s): 6,14,22,30
NUMA node4 CPU(s): 1,9,17,25
NUMA node5 CPU(s): 3,11,19,27
NUMA node6 CPU(s): 5,13,21,29
NUMA node7 CPU(s): 7,15,23,31
-3 panic stack
[...]
[ 5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[ 5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[ 5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[ 5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[ 5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[ 5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[ 5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[ 5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[ 5.773618] PGD 0 P4D 0
[ 5.773618] Oops: 0000 [#1] SMP NOPTI
[ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[ 5.773618] Call Trace:
[ 5.773618] new_slab+0xa9/0x570
[ 5.773618] ___slab_alloc+0x375/0x540
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] __slab_alloc+0x1c/0x38
[ 5.773618] __kmalloc_node_track_caller+0xc8/0x270
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] devm_kmalloc+0x28/0x60
[ 5.773618] pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] really_probe+0x73/0x420
[ 5.773618] driver_probe_device+0x115/0x130
[ 5.773618] __driver_attach+0x103/0x110
[ 5.773618] ? driver_probe_device+0x130/0x130
[ 5.773618] bus_for_each_dev+0x67/0xc0
[ 5.773618] ? klist_add_tail+0x3b/0x70
[ 5.773618] bus_add_driver+0x41/0x260
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] driver_register+0x5b/0xe0
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] do_one_initcall+0x4e/0x1d4
[ 5.773618] ? init_setup+0x25/0x28
[ 5.773618] kernel_init_freeable+0x1c1/0x26e
[ 5.773618] ? loglevel+0x5b/0x5b
[ 5.773618] ? rest_init+0xb0/0xb0
[ 5.773618] kernel_init+0xa/0x110
[ 5.773618] ret_from_fork+0x22/0x40
[ 5.773618] Modules linked in:
[ 5.773618] CR2: 0000000000002088
[ 5.773618] ---[ end trace 1030c9120a03d081 ]---
[...]
-4 other notes about the reproduction of this bug:
On my test machine, this bug is covered by 'commit 0d76bcc960e6 ("Revert
"ACPI/PCI: Pay attention to device-specific _PXM node values"")', but the
crack caused by dev->numa_node is still exposed from other path.
II. history
I had a original try on [1], which took the way by deferring the instance of
offline node.
Later Michal has suggested a fix [2], which only consider node with memory as
online. Beside fixing this bug, that patch also aimed at excluding memory-less
node as a candidate when iterating the zones. It is a pity that the method
conflicts with the scheduler code, which assumes node with cpu as online too.
You can find the broken by "git grep for_each_online_node | grep sched" or the
discussion in tail of [3].
Since Michal has no time to continue on this issue. I pick it up again. This
patch drops the change of "node online" definition in [2], i.e. still consider
node as online if it has either cpu or memory. And keeps the rest main idea in
[2] of initializing all parsed node on x86. For other archs, they need extra
dedicated effort.
[1]: https://patchwork.kernel.org/patch/10738733/
[2]: https://lkml.org/lkml/2019/2/13/253
[3]: https://lore.kernel.org/lkml/20190528182011.GG1658@dhcp22.suse.cz/T/
Signed-off-by: Pingfan Liu <kernelfans@...il.com>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Mike Rapoport <rppt@...ux.ibm.com>
Cc: Tony Luck <tony.luck@...el.com>
Cc: Andy Lutomirski <luto@...nel.org>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Borislav Petkov <bp@...en8.de>
Cc: "H. Peter Anvin" <hpa@...or.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Oscar Salvador <osalvador@...e.de>
Cc: Pavel Tatashin <pavel.tatashin@...rosoft.com>
Cc: Mel Gorman <mgorman@...hsingularity.net>
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>
Cc: Michael Ellerman <mpe@...erman.id.au>
Cc: Stephen Rothwell <sfr@...b.auug.org.au>
Cc: Qian Cai <cai@....pw>
Cc: Barret Rhoden <brho@...gle.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>
Cc: David Rientjes <rientjes@...gle.com>
Cc: linux-mm@...ck.org
Cc: linux-kernel@...r.kernel.org
---
arch/x86/mm/numa.c | 17 ++++++++++++-----
mm/page_alloc.c | 11 ++++++++---
2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b48d507..5f5b558 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid)
*/
}
+static void __init init_parsed_rest_node(void)
+{
+ int node;
+
+ for_each_node_mask(node, node_possible_map)
+ if (!node_online(node))
+ init_memory_less_node(node);
+}
+
/*
* Setup early cpu_to_node.
*
@@ -752,6 +761,7 @@ void __init init_cpu_to_node(void)
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
BUG_ON(cpu_to_apicid == NULL);
+ init_parsed_rest_node();
for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
@@ -759,11 +769,8 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;
- if (!node_online(node)) {
- init_memory_less_node(node);
- node_set_online(nid);
- }
-
+ if (!node_online(node))
+ node_set_online(node);
numa_set_node(cpu, node);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8a..5d8db00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data)
if (self && !node_online(self->node_id)) {
build_zonelists(self);
} else {
- for_each_online_node(nid) {
+ /* In rare case, node_zonelist() hits offline node */
+ for_each_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
-
- build_zonelists(pgdat);
+ /*
+ * This condition can be removed on archs, with all
+ * possible node instanced.
+ */
+ if (pgdat)
+ build_zonelists(pgdat);
}
#ifdef CONFIG_HAVE_MEMORYLESS_NODES
--
2.7.5
Powered by blists - more mailing lists