[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110415203928.1303.A69D9226@jp.fujitsu.com>
Date: Fri, 15 Apr 2011 20:39:01 +0900 (JST)
From: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To: Tejun Heo <tj@...nel.org>
Cc: kosaki.motohiro@...fujitsu.com,
LKML <linux-kernel@...r.kernel.org>,
Yinghai Lu <yinghai@...nel.org>,
Brian Gerst <brgerst@...il.com>,
Cyrill Gorcunov <gorcunov@...il.com>,
Shaohui Zheng <shaohui.zheng@...el.com>,
David Rientjes <rientjes@...gle.com>,
Ingo Molnar <mingo@...e.hu>,
"H. Peter Anvin" <hpa@...ux.intel.com>
Subject: [PATCH v3] x86-64, NUMA: fix fakenuma boot failure
Hello,
> Hello,
>
> On Thu, Apr 14, 2011 at 09:51:00AM +0900, KOSAKI Motohiro wrote:
> > hmm... My carbon copy is not corrupted. Maybe crappy intermediate
> > server override it ?
>
> Sorry about that. Problem was on my side.
>
> The patch itself looks good to me now, so,
>
> Acked-by: Tejun Heo <tj@...nel.org>
>
> but I have some nitpicky comments and it would be nice if you can
> respin the patch with the suggested updates.
Reflected.
>From 38f7fa6d48f2025bf620f1b8b27ccc7e0698d653 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Date: Wed, 13 Apr 2011 15:47:12 +0900
Subject: [PATCH] x86-64, NUMA: fix fakenuma boot failure
Currently, numa=fake boot parameter is broken. If it's used, kernel
may panic due to devide by zero error depending on CPU configuration
Call Trace:
[<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
[<ffffffff81086aff>] ? local_clock+0x6f/0x80
[<ffffffff81050533>] load_balance+0xa3/0x600
[<ffffffff81050f53>] idle_balance+0xf3/0x180
[<ffffffff81550092>] schedule+0x722/0x7d0
[<ffffffff81550538>] ? wait_for_common+0x128/0x190
[<ffffffff81550a65>] schedule_timeout+0x265/0x320
[<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
[<ffffffff81550538>] ? wait_for_common+0x128/0x190
[<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
[<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff81550540>] wait_for_common+0x130/0x190
[<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
[<ffffffff8155067d>] wait_for_completion+0x1d/0x20
[<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
[<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
[<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
[<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
[<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
[<ffffffff81df3d07>] kernel_init+0xc3/0x182
[<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
[<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
[<ffffffff81df3c44>] ? start_kernel+0x400/0x400
[<ffffffff8155d5e0>] ? gs_change+0x13/0x13
The divede by zero is caused following line. (ie group->cpu_power==0)
kernel/sched_fair.c::update_sg_lb_stats()
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
This is regression by commit e23bba6044 (x86-64, NUMA: Unify emulated
distance mapping) because it changes cpu -> node mapping in the process
of dropping fake_physnodes().
old) all cpus are assinged node 0
now) cpus are assigned round robin
(the logic is implemented by numa_init_array())
Note: The change is heppen only if the system doesn't have neigher
ACPI srat table nor AMD northbridge NUMA information.
Why round robin assignment doesn't work? Because init_numa_sched_groups_power()
assumes all logical cpus in the same physical cpu share the same node
(Then it only accounts for group_first_cpu()), and the simple round robin
breaks the above assumption.
Thus, this patch implement to reassign node-id if buggy firmware or numa
emulation makes wrong cpu node map. it enforce all logical cpus in the
same physical cpu share the same node.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Acked-by: Tejun Heo <tj@...nel.org>
Cc: Yinghai Lu <yinghai@...nel.org>
Cc: Brian Gerst <brgerst@...il.com>
Cc: Cyrill Gorcunov <gorcunov@...il.com>
Cc: Shaohui Zheng <shaohui.zheng@...el.com>
Cc: David Rientjes <rientjes@...gle.com>
Cc: Ingo Molnar <mingo@...e.hu>
Cc: H. Peter Anvin <hpa@...ux.intel.com>
---
arch/x86/kernel/smpboot.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c2871d3..8ed8908 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -312,6 +312,26 @@ void __cpuinit smp_store_cpu_info(int id)
identify_secondary_cpu(c);
}
+static void __cpuinit check_cpu_siblings_on_same_node(int cpu1, int cpu2)
+{
+ int node1 = early_cpu_to_node(cpu1);
+ int node2 = early_cpu_to_node(cpu2);
+
+ /*
+ * Our CPU scheduler assumes all logical cpus in the same physical cpu
+ * share the same node. But, buggy ACPI or NUMA emulation might assign
+ * them to different node. Fix it.
+ */
+ if (node1 != node2) {
+ pr_warning("CPU %d in node %d and CPU %d in node %d are in the same physical CPU. forcing same node %d\n",
+ cpu1, node1, cpu2, node2, node2);
+
+ numa_remove_cpu(cpu1);
+ numa_set_node(cpu1, node2);
+ numa_add_cpu(cpu1);
+ }
+}
+
static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
{
cpumask_set_cpu(cpu1, cpu_sibling_mask(cpu2));
@@ -320,6 +340,7 @@ static void __cpuinit link_thread_siblings(int cpu1, int cpu2)
cpumask_set_cpu(cpu2, cpu_core_mask(cpu1));
cpumask_set_cpu(cpu1, cpu_llc_shared_mask(cpu2));
cpumask_set_cpu(cpu2, cpu_llc_shared_mask(cpu1));
+ check_cpu_siblings_on_same_node(cpu1, cpu2);
}
@@ -361,10 +382,12 @@ void __cpuinit set_cpu_sibling_map(int cpu)
per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
cpumask_set_cpu(i, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(i));
+ check_cpu_siblings_on_same_node(cpu, i);
}
if (c->phys_proc_id == cpu_data(i).phys_proc_id) {
cpumask_set_cpu(i, cpu_core_mask(cpu));
cpumask_set_cpu(cpu, cpu_core_mask(i));
+ check_cpu_siblings_on_same_node(cpu, i);
/*
* Does this new cpu bringup a new core?
*/
--
1.7.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists