lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Thu, 26 Feb 2015 10:13:16 +1100
From:	Stewart Smith <stewart@...ux.vnet.ibm.com>
To:	Michael Ellerman <mpe@...erman.id.au>, linuxppc-dev@...abs.org
Cc:	mingo@...nel.org, tglx@...utronix.de,
	Anton Blanchard <anton@...ba.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] powerpc/smp: Wait until secondaries are active & online

Michael Ellerman <mpe@...erman.id.au> writes:

> Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
> "kernel BUG at kernel/smpboot.c:134!" issue during boot:
>
>   BUG_ON(td->cpu != smp_processor_id());
>
> Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
> output confirms it:
>
>   CPU: 0
>   Comm: watchdog/130
>
> The problem is that we aren't ensuring the CPU active bit is set for the
> secondary before allowing the master to continue on. The master unparks
> the secondary CPU's kthreads and the scheduler looks for a CPU to run
> on. It calls select_task_rq() and realises the suggested CPU is not in
> the cpus_allowed mask. It then ends up in select_fallback_rq(), and
> since the active bit isnt't set we choose some other CPU to run on.
>
> This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug
> vs. set_cpus_allowed_ptr()", which changed from setting active before
> online to setting active after online. However that was in turn fixing a
> bug where other code assumed an active CPU was also online, so we can't
> just revert that fix.
>
> The simplest fix is just to spin waiting for both active & online to be
> set. We already have a barrier prior to set_cpu_online() (which also
> sets active), to ensure all other setup is completed before online &
> active are set.
>
> Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
> Signed-off-by: Michael Ellerman <mpe@...erman.id.au>
> Signed-off-by: Anton Blanchard <anton@...ba.org>

By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole
bunch slower (because gcov), I could really *really* reliably reproduce
this. With this patch, I cannot.

Tested-by: Stewart Smith <stewart@...ux.vnet.ibm.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ