linux-kernel - Re: rcu self-detected stall messages on OMAP3, 4 boards

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 12 Sep 2012 18:12:08 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Paul Walmsley <paul@...an.com>
Cc:	"Paul E. McKenney" <paul.mckenney@...aro.org>,
	linux-kernel@...r.kernel.org, bbruce@...com,
	linux-omap@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
	santosh.shilimkar@...com, jon-hunter@...com, snijsure@...d-net.com
Subject: Re: rcu self-detected stall messages on OMAP3, 4 boards

On Wed, Sep 12, 2012 at 10:51:30PM +0000, Paul Walmsley wrote:
> 
> Hi Paul
> 
> Recently several of us have been seeing "INFO: rcu_sched self-detected
> stall on CPU { 0} (t=20611 jiffies)" stack tracebacks on various OMAP3
> and 4 board.
> 
> I only noticed it during v3.6-rc3, but I suspect it's been happening 
> for users at least since May:
> 
> http://www.mail-archive.com/linux-omap@vger.kernel.org/msg68942.html
> 
> 
> The only quasi-reproducible test case that I've found so far 
> is to boot the board with serial console enabled to a login prompt, then 
> wait for a few minutes, then send a keypress to the board via serial.
> The tracebacks I get look like this:

Interesting.  I am assuming that the interrupt in the stack below came
from idle, if not, please let me know what.

Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
give me a bit more information about why RCU thought that there was
a stall.  (CCing Becky Bruce, who saw something similar recently.)

Subodh Nijsure (also CCed) reported something that might be similar on
ARM, and also reported that setting the following got rid of the stalls:

	CONFIG_CPU_IDLE=y
	CONFIG_CPU_IDLE_GOV_LADDER=y
	CONFIG_CPU_IDLE_GOV_MENU=y

At which point he was happy, which was good, but which also left the
underlying problem unsolved.  Do these affect your system?  If so,
do they cause a different ARM idle loop to be executed?

							Thanx, Paul

> [  467.480712] INFO: rcu_sched self-detected stall on CPU { 0}  (t=20611 jiffies)
> [  467.484741] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc94>] (rcu_check_callbacks+0x180/0x630)
> [  467.489929] [<c00acc94>] (rcu_check_callbacks+0x180/0x630) from [<c0052b18>] (update_process_times+0x38/0x68)
> [  467.495361] [<c0052b18>] (update_process_times+0x38/0x68) from [<c008c04c>] (tick_sched_timer+0x80/0xec)
> [  467.500518] [<c008c04c>] (tick_sched_timer+0x80/0xec) from [<c0068544>] (__run_hrtimer+0x7c/0x1e0)
> [  467.505401] [<c0068544>] (__run_hrtimer+0x7c/0x1e0) from [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0)
> [  467.510437] [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
> [  467.515350] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c)
> [  467.520477] [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a3914>] (generic_handle_irq+0x30/0x48)
> [  467.525939] [<c00a3914>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
> [  467.530731] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
> [  467.535339] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ce4>] (__irq_svc+0x44/0x5c)
> [  467.539886] Exception stack(0xc0729f58 to 0xc0729fa0)
> [  467.542663] 9f40:                                                       00047f2a 00000001
> [  467.547119] 9f60: 00000000 c074a940 c0728000 c07c4b08 c05045a0 c074be20 00000000 411fc092
> [  467.551574] 9f80: c074c040 00000000 00000001 c0729fa0 00047f2b c0014f50 20000113 ffffffff
> [  467.556030] [<c04f8ce4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
> [  467.560485] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
> [  467.564971] [<c001517c>] (cpu_idle+0x9c/0x114) from [<c06d77b0>] (start_kernel+0x2b4/0x304)
> 
> Looks like this message was added as of commit
> a858af2875fb291d0f4b0a4419fefbf03c2379c0 ("rcu: Print scheduling-clock
> information on RCU CPU stall-warning messages").
> 
> Do you have any suggestions for how we can determine what is causing
> this?
> 
> Here's an example of a kernel config that we use:
> 
> http://www.pwsan.com/omap/testlogs/am33xx_hwmod_clock_devel_3.7/20120912092510/build/omap2plus_defconfig/Kconfig
> 
> A few observations that may or may not be relevant: we use NO_HZ, and 
> we also have a clockevents timer that is relatively slow to program.
> 
> 
> regards,
> 
> - Paul
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/