linux-kernel - Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160330125837.GC14352@fixme-laptop.cn.ibm.com>
Date:	Wed, 30 Mar 2016 20:58:38 +0800
From:	Boqun Feng <boqun.feng@...il.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Jacob Pan <jacob.jun.pan@...ux.intel.com>,
	Josh Triplett <josh@...htriplett.org>,
	Ross Green <rgkernel@...il.com>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	John Stultz <john.stultz@...aro.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	lkml <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Lai Jiangshan <jiangshanlai@...il.com>, dipankar@...ibm.com,
	Andrew Morton <akpm@...ux-foundation.org>,
	rostedt <rostedt@...dmis.org>,
	David Howells <dhowells@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	pranith kumar <bobby.prani@...il.com>,
	"Chatre, Reinette" <reinette.chatre@...el.com>
Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

Hi Paul,

On Mon, Mar 28, 2016 at 06:29:32AM -0700, Paul E. McKenney wrote:
> On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
> > On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
> > 
> > > > Does that system have MONITOR/MWAIT errata?
> > > 
> > > On the off-chance that this question was also directed at me,
> > 
> > Hehe, it wasn't, however, since we're here..
> > 
> > > here is
> > > what I am running on.  I am running in a qemu/KVM virtual machine, in
> > > case that matters.
> > 
> > Have you actually tried on real proper hardware? Does it still reproduce
> > there?
> 
> Ross has, but I have not, given that I have a shared system on the one
> hand and a single-socket (four core, eight hardware thread) laptop on
> the other that has even longer reproduction times.  The repeat-by is
> as follows:
> 
> o	Build a kernel with the following Kconfigs:
> 
> 	CONFIG_SMP=y
> 	CONFIG_NR_CPUS=16
> 	CONFIG_PREEMPT_NONE=n
> 	CONFIG_PREEMPT_VOLUNTARY=n
> 	CONFIG_PREEMPT=y
> 	# This should result in CONFIG_PREEMPT_RCU=y
> 	CONFIG_HZ_PERIODIC=y
> 	CONFIG_NO_HZ_IDLE=n
> 	CONFIG_NO_HZ_FULL=n
> 	CONFIG_RCU_TRACE=y
> 	CONFIG_HOTPLUG_CPU=y
> 	CONFIG_RCU_FANOUT=2
> 	CONFIG_RCU_FANOUT_LEAF=2
> 	CONFIG_RCU_NOCB_CPU=n
> 	CONFIG_DEBUG_LOCK_ALLOC=n
> 	CONFIG_RCU_BOOST=y
> 	CONFIG_RCU_KTHREAD_PRIO=2
> 	CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> 	CONFIG_RCU_EXPERT=y
> 	CONFIG_RCU_TORTURE_TEST=y
> 	CONFIG_PRINTK_TIME=y
> 	CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> 	CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> 	CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> 
> 	If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> 	and modprobe/insmod the module manually.
> 
> o	Find a two-socket x86 system or larger, with at least 16 CPUs.
> 
> o	Boot the kernel with the following kernel boot parameters:
> 
> 	rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> 
> 	The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> 	When manually setting up the module, you get the holdoff for
> 	free, courtesy of human timescales.
> 

I ran several tests similar to this setup in a guest, the guest was
running in a 12-core Xeon X5690 host with kernel 4.4.5. The guest was
given 16 VCPUs and 2G memory.

Rather than RCU stall warnings, I was able to see a bug like this nearly
*everytime*:

[  308.291635] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  308.292063] IP: [<ffffffff810b9452>] migrate_timer_list+0x32/0xc0
[  308.292063] PGD 0 
[  308.292063] Oops: 0002 [#1] PREEMPT SMP 
[  308.292063] Modules linked in:
[  308.292063] CPU: 11 PID: 798 Comm: torture_onoff Not tainted 4.5.0-rc1+ #14
[  308.292063] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-20160216_104851-anatol 04/01/2014
[  308.292063] task: ffff88007cafcb00 ti: ffff88007c6f4000 task.ti: ffff88007c6f4000
[  308.292063] RIP: 0010:[<ffffffff810b9452>]  [<ffffffff810b9452>] migrate_timer_list+0x32/0xc0
[  308.292063] RSP: 0018:ffff88007c6f7c98  EFLAGS: 00010086
[  308.292063] RAX: dead000000000200 RBX: ffff88007fb6cc00 RCX: ffff88007fa2cc00
[  308.292063] RDX: 0000000000000000 RSI: ffff88007fa2d6d0 RDI: ffff88007fb6cc00
[  308.292063] RBP: ffff88007c6f7cc0 R08: 0000000000000000 R09: ffffffff81e4d360
[  308.292063] R10: ffffea0001f2daa0 R11: 0000000000000000 R12: ffff88007fa2d6d0
[  308.292063] R13: ffff88007fb6cc00 R14: 000000000000000b R15: ffff88007fa2c420
[  308.292063] FS:  0000000000000000(0000) GS:ffff88007fb60000(0000) knlGS:0000000000000000
[  308.292063] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  308.292063] CR2: 0000000000000000 CR3: 0000000001e0f000 CR4: 00000000000006e0
[  308.292063] Stack:
[  308.292063]  ffff88007fb6cc00 ffff88007fa2d6d8 ffff88007fa2dad0 ffff88007fa2d8d0
[  308.292063]  ffff88007fa2d4d8 ffff88007c6f7d08 ffffffff810b95c4 ffff88007fa2cc00
[  308.292063]  ffff88007fa2d648 00000000fffffff0 ffffffff81e4e0c0 0000000000000007
[  308.292063] Call Trace:
[  308.292063]  [<ffffffff810b95c4>] timer_cpu_notify+0xe4/0x160
[  308.292063]  [<ffffffff81073ea5>] notifier_call_chain+0x45/0x70
[  308.292063]  [<ffffffff81073fc9>] __raw_notifier_call_chain+0x9/0x10
[  308.292063]  [<ffffffff8105759e>] cpu_notify_nofail+0x1e/0x30
[  308.292063]  [<ffffffff81057852>] _cpu_down+0x212/0x350
[  308.292063]  [<ffffffff810b3350>] ? __call_rcu.constprop.74+0x380/0x380
[  308.292063]  [<ffffffff810b33b0>] ? call_rcu_bh+0x20/0x20
[  308.292063]  [<ffffffff810a8950>] ? __rcu_read_unlock+0x50/0x50
[  308.292063]  [<ffffffff810a8950>] ? __rcu_read_unlock+0x50/0x50
[  308.292063]  [<ffffffff810579bf>] cpu_down+0x2f/0x50
[  308.292063]  [<ffffffff811266b5>] torture_onoff+0x115/0x380
[  308.292063]  [<ffffffff811265a0>] ? torture_kthread_stopping+0x70/0x70
[  308.292063]  [<ffffffff810732f6>] kthread+0xd6/0xf0
[  308.292063]  [<ffffffff81073220>] ? kthread_worker_fn+0x180/0x180
[  308.292063]  [<ffffffff818b499f>] ret_from_fork+0x3f/0x70
[  308.292063]  [<ffffffff81073220>] ? kthread_worker_fn+0x180/0x180
[  308.292063] Code: 56 41 55 41 54 53 48 8b 06 44 8b 77 40 48 85 c0 74 53 49 89 fd 49 89 f4 4d 8b 3c 24 66 66 66 66 90 49 8b 07 49 8b 57 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 02 00 00 00 00 ad de 4c 89 
[  308.292063] RIP  [<ffffffff810b9452>] migrate_timer_list+0x32/0xc0
[  308.292063]  RSP <ffff88007c6f7c98>
[  308.292063] CR2: 0000000000000000
[  308.292063] ---[ end trace 29cb5d70ad875872 ]---


I'm not sure whether this is relevant, maybe this was caused by my
setup? But I attach two kernel logs and my config in case this could
help us solve the problem.

Another interesting I observed was that the time stamps of printk went
back in a while in some run, please note the log named "dmesg.1" in the
attachments.

A quick analysis of this bug is the null dereference happened at:

	migrate_timer_list()
	  detach_timer();
	    __list_del();
	      WRITE_ONCE(*pprev, next); <- pprev is NULL

This seems highly unlikely, but as I said, I hit this almost every time
after a few minutes of starting rcutoture.

Any idea about whether this could be related to the RCU stall problem,
or this is another bug, or this actually happend by design for some
reasons?

Regards,
Boqun

> In the absence of instrumentation, I get failures usually within a
> couple of hours, though sometimes much longer.  With instrumentation,
> the sky appears to be the limit.  :-/
> 
> Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> is of more immediate interest.  He is seeing the same symptoms that I am,
> namely a task being repeatedly awakened without actually coming out of
> TASK_INTERRUPTIBLE state, let alone running.  As you pointed out earlier,
> he cannot be seeing the same bug that my crude patch suppresses, but
> given that I still see a few failures with that crude patch, it is quite
> possible that there is still a common bug.
> 
> 							Thanx, Paul
> 

View attachment "dmesg.1" of type "text/plain" (244585 bytes)

View attachment "dmesg.2" of type "text/plain" (150216 bytes)

View attachment "config" of type "text/plain" (104942 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (474 bytes)