linux-kernel - Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB kthread wakeups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20140822215720.GA21092@linux.vnet.ibm.com>
Date:	Fri, 22 Aug 2014 14:57:20 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Amit Shah <amit.shah@...hat.com>
Cc:	linux-kernel@...r.kernel.org, riel@...hat.com, mingo@...nel.org,
	laijs@...fujitsu.com, dipankar@...ibm.com,
	akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
	josh@...htriplett.org, tglx@...utronix.de, peterz@...radead.org,
	rostedt@...dmis.org, dhowells@...hat.com, edumazet@...gle.com,
	dvhart@...ux.intel.com, fweisbec@...il.com, oleg@...hat.com,
	sbw@....edu
Subject: Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB
 kthread wakeups

On Fri, Aug 22, 2014 at 02:53:44PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 22, 2014 at 10:44:05PM +0530, Amit Shah wrote:
> > On (Fri) 22 Aug 2014 [07:48:19], Paul E. McKenney wrote:
> > > On Fri, Aug 22, 2014 at 06:26:49PM +0530, Amit Shah wrote:
> > > > On (Fri) 22 Aug 2014 [18:06:51], Amit Shah wrote:
> > > > > On (Fri) 22 Aug 2014 [17:54:53], Amit Shah wrote:
> > > > > > On (Mon) 18 Aug 2014 [21:01:49], Paul E. McKenney wrote:
> > > > > > 
> > > > > > > The odds are low over the next few days.  I am adding nastier rcutorture
> > > > > > > testing, however.  It would still be very good to get debug information
> > > > > > > from your setup.  One approach would be to convert the trace function
> > > > > > > calls into printk(), if that would help.
> > > > > > 
> > > > > > I added a few printks on the lines of the traces in cases where
> > > > > > rcu_nocb_poll was checked -- since that reproduces the hang.  Are the
> > > > > > following traces sufficient, or should I keep adding more printks?
> > > > > > 
> > > > > > In the case of rcu-trace-nopoll.txt, the messages stop after a while
> > > > > > (when the guest locks up hard).  That's when I kill the qemu process.
> > > > > 
> > > > > And this is bt from gdb when the endless 
> > > > > 
> > > > >   RCUDEBUG __call_rcu_nocb_enqueue 2146 rcu_preempt 0 WakeNot
> > > > > 
> > > > > messages are being spewed.
> > > > > 
> > > > > I can't time it, but hope it gives some indication along with the printks.
> > > > 
> > > > ... and after the system 'locks up', this is the state it's in:
> > > > 
> > > > ^C
> > > > Program received signal SIGINT, Interrupt.
> > > > native_safe_halt () at ./arch/x86/include/asm/irqflags.h:50
> > > > 50		 }
> > > > (gdb) bt
> > > > #0  native_safe_halt () at ./arch/x86/include/asm/irqflags.h:50
> > > > #1  0xffffffff8100b9c1 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:111
> > > > #2  default_idle () at arch/x86/kernel/process.c:311
> > > > #3  0xffffffff8100c107 in arch_cpu_idle () at arch/x86/kernel/process.c:302
> > > > #4  0xffffffff8106a25a in cpuidle_idle_call () at kernel/sched/idle.c:120
> > > > #5  cpu_idle_loop () at kernel/sched/idle.c:220
> > > > #6  cpu_startup_entry (state=<optimized out>) at kernel/sched/idle.c:268
> > > > #7  0xffffffff813e068b in rest_init () at init/main.c:418
> > > > #8  0xffffffff81a8cf5a in start_kernel () at init/main.c:680
> > > > #9  0xffffffff81a8c4ba in x86_64_start_reservations (real_mode_data=<optimized out>) at arch/x86/kernel/head64.c:193
> > > > #10 0xffffffff81a8c607 in x86_64_start_kernel (real_mode_data=0x13f90 <cpu_lock_stats+29184> <error: Cannot access memory at address 0x13f90>)
> > > >     at arch/x86/kernel/head64.c:182
> > > > #11 0x0000000000000000 in ?? ()
> > > > 
> > > > 
> > > > Wondering why it's doing this.  Am stepping through
> > > > cpu_startup_entry() to see if I get any clues.
> > > 
> > > This looks to me like normal behavior in the x86 ACPI idle loop.
> > > My guess is that the lockup is caused by indefinite blocking, in
> > > which case we would expect all the CPUs to be in the idle loop.
> > 
> > Hm, found it:
> > 
> > The stall happens in do_initcalls().
> > 
> > pm_sysrq_init() is the function that causes the hang.  When I #if 0
> > the line
> > 
> >     register_sysrq_key('o', &sysrq_poweroff_op);
> > 
> > in pm_sysrq_init(), the boot proceeds normally.
> 
> Yow!!!
> 
> > Now what this is, and what relation this has to rcu and that patch in
> > particular is next...
> 
> Hmmm...  Please try replacing the synchronize_rcu() in
> __sysrq_swap_key_ops() with (say) schedule_timeout_interruptible(HZ / 10).
> I bet that gets rid of the hang.  (And also introduces a low-probability
> bug, but should be OK for testing.)
> 
> The other thing to try is to revert your patch that turned my event
> traces into printk()s, then put an ftrace_dump(DUMP_ALL); just after
> the synchronize_rcu() -- that might make it so that the ftrace data
> actually gets dumped out.

And one other thing to try...

Put a printk at the beginning of rcu_spawn_gp_kthread(), which is in
kernel/rcu/tree.c.  If that printk does not appear before the call
to pm_sysrq_init(), that would be an important clue.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/