linux-kernel - Re: mcelog stalls on 2.6.39-rc5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4DB990A1.8090205@sandia.gov>
Date:	Thu, 28 Apr 2011 10:06:57 -0600
From:	"Jim Schutt" <jaschut@...dia.gov>
To:	paulmck@...ux.vnet.ibm.com
cc:	linux-kernel@...r.kernel.org
Subject: Re: mcelog stalls on 2.6.39-rc5

Jim Schutt wrote:
> Paul E. McKenney wrote:
>> On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
>>> Hi,
>>>
>>> Testing 2.6.39-rc5 is giving me the following stall:
>>>
>>>     [ 5767.731001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=60001 jiffies)
>>>     [ 5767.732001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
>>>     [ 5947.763001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=240032 jiffies)
>>>     [ 5947.764001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
>>>     [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
>>>     [ 6121.281139] INFO: task mcelog:6513 blocked for more than 120 
>>> seconds.
>>>     [ 6121.287575] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables this message.
>>>     [ 6121.295397]  ffff880177aefce8 0000000000000082 
>>> ffffffff810339b6 ffff880226d35a40
>>>     [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0 
>>> 0000000000011e80 ffff880226d35a40
>>>     [ 6121.310284]  ffff88018226c760 ffff880177aefe80 
>>> ffff880177aefd18 ffffffff813af047
>>>     [ 6121.317725] Call Trace:
>>>     [ 6121.320176]  [<ffffffff810339b6>] ? 
>>> calc_load_account_idle+0xe/0x1d
>>>     [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
>>>     [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
>>>     [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
>>>     [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
>>>     [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
>>>     [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
>>>     [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
>>>     [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
>>>     [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
>>>     [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
>>>     [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
>>>     [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
>>>     [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
>>>     [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
>>>     [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
>>>     [ 6127.795001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=420064 jiffies)
>>>     [ 6127.796001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
>>>     [ 6241.410171] INFO: task mcelog:6513 blocked for more than 120 
>>> seconds.
>>>
>>> Reverting commit a4dd99250dc makes the stalls go away:
>>>
>>>     rcu: create new rcu_access_index() and use in mce
>>>
>>>     The MCE subsystem needs to sample an RCU-protected index outside of
>>>     any protection for that index.  If this was a pointer, we would use
>>>     rcu_access_pointer(), but there is no corresponding 
>>> rcu_access_index().
>>>     This commit therefore creates an rcu_access_index() and applies it
>>>     to MCE.
>>>
>>>     Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
>>>     Tested-by: Zdenek Kabelac <zkabelac@...hat.com>
>>
>> Wow!  This is just instructions, just wrapping the access in
>> ACCESS_ONCE().
>>
>> Was the original RCU CPU stall repeatable?
> 
> Yes.  I saw it on two different machines.
> Both were running as Ceph clients, doing sustained
> streaming writes, if that helps any.
> 
> I can attempt to repeat with any extra debugging
> you'd like me to try.

Evidently I cannot reproduce this after all, so my revert
test was incorrect.

If I can learn how to reproduce this I'll post with details;
in the meantime ignore this, and sorry for the noise.

-- Jim

> 
> -- Jim
> 
>>
>>                             Thanx, Paul
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/