linux-kernel - Re: Threaded irqs + 100% CPU RT task = RCU stall

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 13 Mar 2013 17:03:07 -0400
From:	Paul Gortmaker <paul.gortmaker@...driver.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	<linux-rt-users@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: Threaded irqs + 100% CPU RT task = RCU stall

[Re: Threaded irqs + 100% CPU RT task = RCU stall] On 06/03/2013 (Wed 13:45) Paul E. McKenney wrote:

[...]

> 
> Is this behavior OK?  If so, the following (untested) patch might do
> what you want.  ;-)
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> rcu: Add softirq-stall indications to stall-warning messages

[...]

> 
> +The "softirq=" portion of the message tracks the number of RCU softirq
> +handlers that the stalled CPU has executed.  The number before the "/"
> +is the number that had executed since boot at the time that this CPU
> +last noted the beginning of a grace period, which might be the current
> +(stalled) grace period, or it might be some earlier grace period (for
> +example, if the CPU might have been in dyntick-idle mode for an extended
> +time period.  The number after the "/" is the number that have executed
> +since boot until the current time.  If this latter number stays constant
> +across repeated stall-warning messages, it is possible that RCU's softirq
> +handlers are no longer able to execute on this CPU.  This can happen if
> +the stalled CPU is spinning with interrupts are disabled, or, in -rt
> +kernels, if a high-priority process is starving RCU's softirq handler.

Here is the output of two consecutive stalls (triggered exactly as I'd
described before) after applying the commit and enabling the new config
option for RCU_CPU_STALL_INFO (btw, do we need this?  we already have
the RCU_CPU_STALL_VERBOSE option, and the distinction isn't clear.)

Looking at the output, it doesn't necessarily scream out "you are an
idiot" in a way that Joe Average can immediately parse and understand,
but I guess it does at least arm us with more information so that we
can tell Joe Average that he is an idiot (assuming he posts more than
just a single stall instance).

Also note right after the <EOI> below, it looks like two stall
messages got interleavedi, or a carriage return got dropped...
(not suggesting that this patch caused that.).

Paul.
--

INFO: rcu_preempt self-detected stall on CPU
INFO: rcu_preempt detected stalls on CPUs/tasks:
	1: (1 GPs behind) idle=f4f/140000000000001/0 softirq=2256/2257 
	(detected by 5, t=60002 jiffies, g=324, c=323, q=1368)
Task dump for CPU 1:
eatme-simple    R  running task        0  1487   1433 0x00000000
 ffff88042ef47f60 ffffffff81316de1 ffff88042e5f5810 ffff88042ef47fd8
 0000000000010c00 ffff88042ef47fd8 ffff88042f994210 ffff88042e5f5810
 0000000000000000 ffff88043f4fe980 ffffffff810a56e4 0000000000000203
Call Trace:
 [<ffffffff81316de1>] ? __schedule+0x62a/0x75e
 [<ffffffff810a56e4>] ? dput+0x20/0x15c
 [<ffffffff81096070>] ? __fput+0x1a1/0x1c8
 [<ffffffff810ab833>] ? mntput_no_expire+0x13/0x11f
 [<ffffffff8101c55c>] ? do_page_fault+0x1f/0x3b

	1: (1 GPs behind) idle=f4f/140000000000001/0 softirq=2256/2257 
	 (t=60082 jiffies g=324 c=323 q=1368)
Pid: 1487, comm: eatme-simple Not tainted 3.9.0-rc2+ #2
Call Trace:
 <IRQ>  [<ffffffff8105cac6>] ? rcu_check_callbacks+0x215/0x61a
 [<ffffffff8102cffc>] ? update_process_times+0x31/0x5c
 [<ffffffff8104bb5b>] ? tick_handle_periodic+0x18/0x52
 [<ffffffff81016328>] ? smp_apic_timer_interrupt+0x7d/0x8f
 [<ffffffff8131944a>] ? apic_timer_interrupt+0x6a/0x70
 <EOI> 
INFO: rcu_preempt self-detected stall on CPUINFO: rcu_preempt detected stalls on CPUs/tasks:
	1: (1 GPs behind) idle=f4f/140000000000001/0 softirq=2256/2257 
	(detected by 5, t=240007 jiffies, g=324, c=323, q=9386)
Task dump for CPU 1:
eatme-simple    R  running task        0  1487   1433 0x00000000
 ffff88042ef47f60 ffffffff81316dfd ffff88042e5f5810 ffff88042ef47fd8
 0000000000010c00 ffff88042ef47fd8 ffff88042e5f5810 ffff88042e5f5810
 0000000000000000 ffff88043f4fe980 ffffffff810a56e4 0000000000000203
Call Trace:
 [<ffffffff81316dfd>] ? __schedule+0x646/0x75e
 [<ffffffff810a56e4>] ? dput+0x20/0x15c
 [<ffffffff81096070>] ? __fput+0x1a1/0x1c8
 [<ffffffff810ab833>] ? mntput_no_expire+0x13/0x11f
 [<ffffffff8101c55c>] ? do_page_fault+0x1f/0x3b

	1: (1 GPs behind) idle=f4f/140000000000001/0 softirq=2256/2257 
	 (t=240087 jiffies g=324 c=323 q=9386)
Pid: 1487, comm: eatme-simple Not tainted 3.9.0-rc2+ #2
Call Trace:
 <IRQ>  [<ffffffff8105cac6>] ? rcu_check_callbacks+0x215/0x61a
 [<ffffffff8102cffc>] ? update_process_times+0x31/0x5c
 [<ffffffff8104bb5b>] ? tick_handle_periodic+0x18/0x52
 [<ffffffff81016328>] ? smp_apic_timer_interrupt+0x7d/0x8f
 [<ffffffff8131944a>] ? apic_timer_interrupt+0x6a/0x70
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/