lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.02.1109071115410.2723@ionos>
Date:	Wed, 7 Sep 2011 11:25:29 +0200 (CEST)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Frank Rowand <frank.rowand@...sony.com>
cc:	"paulmck@...ux.vnet.ibm.com" <paulmck@...ux.vnet.ibm.com>,
	"Rowand, Frank" <Frank_Rowand@...yusa.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-rt-users <linux-rt-users@...r.kernel.org>,
	Mike Galbraith <efault@....de>, Ingo Molnar <mingo@...x.de>,
	Venkatesh Pallipadi <venki@...gle.com>,
	Russell King <linux@....linux.org.uk>
Subject: Re: [ANNOUNCE] 3.0.1-rt11

On Tue, 6 Sep 2011, Frank Rowand wrote:

> On 08/26/11 16:55, Paul E. McKenney wrote:
> > On Wed, Aug 24, 2011 at 04:58:49PM -0700, Frank Rowand wrote:
> >> On 08/13/11 03:53, Peter Zijlstra wrote:
> >>>
> >>> Whee, I can skip release announcements too!
> >>>
> >>> So no the subject ain't no mistake its not, 3.0.1-rt11 is there for the
> >>> grabs.
> 
> < snip >
> 
> >> I have a consistent (every boot) hang on boot.  With a few
> >> hacks to get console output, I get the
> >>
> >>   rcu_preempt_state detected stalls on CPUs/tasks
> 
> < snip >
> 
> >> This is an ARM NaviEngine (out of tree, so I also have applied
> >> a series of pages for platform support).
> >>
> >> CONFIG_PREEMPT_RT_FULL is set.  Full config is attached.
> 
> I have also replicated the problem on the ARM RealView (in tree) and
> without the RT patches.
> 
> > 
> > Hmmm...  The last few that I have seen that looked like this were
> > due to my messing up rcutorture so that the RCU-boost testing kthreads
> > ran CPU-bound at real-time priority.
> > 
> > Is it possible that something similar is happening on your system?
> > 
> >                                                         Thanx, Paul
> 
> The problem ended up being caused by the allowed cpus mask being set
> to all possible cpus for the ksoftirqd on the secondary processors.
> So the RCU softirq was never executing on cpu 2.
> 
> I'll test the following patch on 3.1 tomorrow.
> 
> -Frank Rowand
> 
> 
> Symptom: rcu stall
> 
> The problem was that ksoftirqd was woken on the secondary processors before
> the secondary processors were online.  This led to allowed cpus being set
> to all cpus.
> 
>    wake_up_process()
>       try_to_wake_up()
>          select_task_rq()
>             if (... || !cpu_online(cpu))
>                select_fallback_rq(task_cpu(p), p)
>                   ...
>                   /* No more Mr. Nice Guy. */
>                   dest_cpu = cpuset_cpus_allowed_fallback(p)
>                      do_set_cpus_allowed(p, cpu_possible_mask)
>                         #  Thus ksoftirqd can now run on any cpu...

This smells badly like the problem we've seen on x86 before. And
looking at the arm SMP boot code:

asmlinkage void __cpuinit secondary_start_kernel(void)
{
	.....

	/*
	 * Give the platform a chance to do its own initialisation.
	 */
	platform_secondary_init(cpu);

	/*
	 * Enable local interrupts.
	 */
	notify_cpu_starting(cpu);
	local_irq_enable();

Here we enable interrupts, but the CPU is neither online nor active.

	local_fiq_enable();

	/*
	 * Setup the percpu timer for this CPU.
	 */
	percpu_timer_setup();

	calibrate_delay();

	smp_store_cpu_info(cpu);

	/*
	 * OK, now it's safe to let the boot CPU continue.  Wait for
	 * the CPU migration code to notice that the CPU is online
	 * before we continue.
	 */
	set_cpu_online(cpu, true);
	while (!cpu_active(cpu))
		cpu_relax();

That's the same thing as x86 is doing, just with interrupts enabled
and therefor it does not help. And the softirq is only part of the
problem, the same can happen with worker threads and other cpu bound
nasties.

	/*
	 * OK, it's off to the idle thread for us
	 */
	cpu_idle();
}

So that wants to be ordered differently. Patch below.

Thanks,

	tglx

Index: linux-2.6/arch/arm/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/smp.c
+++ linux-2.6/arch/arm/kernel/smp.c
@@ -305,6 +305,16 @@ asmlinkage void __cpuinit secondary_star
 	 * Enable local interrupts.
 	 */
 	notify_cpu_starting(cpu);
+
+	/*
+	 * OK, now it's safe to let the boot CPU continue.  Wait for
+	 * the CPU migration code to notice that the CPU is online
+	 * before we continue.
+	 */
+	set_cpu_online(cpu, true);
+	while (!cpu_active(cpu))
+		cpu_relax();
+
 	local_irq_enable();
 	local_fiq_enable();
 
@@ -318,15 +328,6 @@ asmlinkage void __cpuinit secondary_star
 	smp_store_cpu_info(cpu);
 
 	/*
-	 * OK, now it's safe to let the boot CPU continue.  Wait for
-	 * the CPU migration code to notice that the CPU is online
-	 * before we continue.
-	 */
-	set_cpu_online(cpu, true);
-	while (!cpu_active(cpu))
-		cpu_relax();
-
-	/*
 	 * OK, it's off to the idle thread for us
 	 */
 	cpu_idle();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ