linux-kernel - Re: BUG: tick device NULL pointer during system initialization and shutdown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130701154109.GK3773@linux.vnet.ibm.com>
Date:	Mon, 1 Jul 2013 08:41:09 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Prarit Bhargava <prarit@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>, athorlton@....com,
	CAI Qian <caiqian@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: BUG: tick device NULL pointer during system initialization and
 shutdown

On Mon, Jul 01, 2013 at 03:30:47PM +0200, Thomas Gleixner wrote:
> On Mon, 1 Jul 2013, Prarit Bhargava wrote:
> > On 06/28/2013 06:52 AM, Thomas Gleixner wrote:
> > > Huch. Did the warning in the broadcast code trigger before that?
> > 
> > tglx,
> > 
> > AFAICT it does not.  Log below on the system I'm testing on.  The test on the
> > system is system boots, sleeps for 30 seconds and then reboots.
> 
> > [  270.563197] INFO: rcu_sched detected stalls on CPUs/tasks: { 51} (detected by
> > 63, t=217205 jiffies, g=3583, c=3582, q=578)
> 
> So the stall is on CPU51, but we do not get a backtrace for CPU51. 
> 
> The backtrace trigger is only sent to online cpus. So CPU51 is offline
> already. Which makes sense as we are in the process of bringing CPUs
> down and the CPUs with backtrace are 0 and 53-63.
> 
> I'm pretty sure, that the patch which clears the stale flag is
> unrelated to this and it cures the NULL pointer dereference (the
> reason why this can happen is clear).
> 
> So now you do not longer trip over the NULL pointer dereference, but
> you see a weird RCU stall on an already DEAD cpu. Note, it's dead
> because we already took CPU52 offline as well.
> 
> Paul???

Odd.  The force-quiescent-state machinery should notice that the
dead CPU gets a false return from cpu_is_offline(), at which point it
should not a quiescent state on behalf of that CPU and get on with the
grace period.

In the meantime, here are my guesses as to what might be causing this bug:

o	RCU's grace-period kthreads got stuck somehow.  One way that
	this could happen is if you don't have commit #971394f3 (Fix
	deadlock with CPU hotplug, RCU GP init, and timer migration)
	but do have CONFIG_PROVE_RCU_DELAY=y.

o	The handling of CPU-hotplug bitmaps has changed so that RCU
	needs to do something other than cpu_offline().  I have been
	expecting that RCU would be needing to keep its own mask of
	online CPUs at some point, but didn't think that time had
	arrived.

If neither of those help, then it is time for me to add more information
to CONFIG_RCU_CPU_STALL_INFO.  ;-)

							Thanx, Paul

> Thanks,
> 
> 	tglx
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/