[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <m3fv9tegab.fsf@oc8180480414.ibm.com>
Date: Thu, 26 Feb 2015 10:13:16 +1100
From: Stewart Smith <stewart@...ux.vnet.ibm.com>
To: Michael Ellerman <mpe@...erman.id.au>, linuxppc-dev@...abs.org
Cc: mingo@...nel.org, tglx@...utronix.de,
Anton Blanchard <anton@...ba.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] powerpc/smp: Wait until secondaries are active & online
Michael Ellerman <mpe@...erman.id.au> writes:
> Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>
> BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
> CPU: 0
> Comm: watchdog/130
>
> The problem is that we aren't ensuring the CPU active bit is set for the
> secondary before allowing the master to continue on. The master unparks
> the secondary CPU's kthreads and the scheduler looks for a CPU to run
> on. It calls select_task_rq() and realises the suggested CPU is not in
> the cpus_allowed mask. It then ends up in select_fallback_rq(), and
> since the active bit isnt't set we choose some other CPU to run on.
>
> This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug
> vs. set_cpus_allowed_ptr()", which changed from setting active before
> online to setting active after online. However that was in turn fixing a
> bug where other code assumed an active CPU was also online, so we can't
> just revert that fix.
>
> The simplest fix is just to spin waiting for both active & online to be
> set. We already have a barrier prior to set_cpu_online() (which also
> sets active), to ensure all other setup is completed before online &
> active are set.
>
> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
> Signed-off-by: Michael Ellerman <mpe@...erman.id.au>
> Signed-off-by: Anton Blanchard <anton@...ba.org>
By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole
bunch slower (because gcov), I could really *really* reliably reproduce
this. With this patch, I cannot.
Tested-by: Stewart Smith <stewart@...ux.vnet.ibm.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists