linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20141219035859.GA20022@redhat.com>
Date:	Thu, 18 Dec 2014 22:58:59 -0500
From:	Dave Jones <davej@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Thu, Dec 18, 2014 at 07:49:41PM -0800, Linus Torvalds wrote:

 > And when spinlocks start getting  contention, *nested* spinlocks
 > really really hurt. And you've got all the spinlock debugging on etc,
 > don't you?

Yeah, though remember this seems to have for some reason gotten worse
in more recent builds. I've been running kitchen-sink debug kernels
for my trinity runs for the last three years, and it's only this
last few months that this has got to be enough of a problem that I'm
not seeing the more interesting bugs. (Or perhaps we're just getting
better at fixing them in -next now, so my runs are lasting longer..)

 > Also, you do have this:
 > 
 >   sched: RT throttling activated
 > 
 > so there's something going on with RT scheduling too.

I see that fairly often. I've never dug into exactly what causes it, but
it seems to be triggerable just by some long running CPU hogs.

 > So your printouts are finally starting to make sense. But I'm also
 > starting to suspect strongly that the problem is that with all your
 > lock debugging and other overheads (does this still have
 > DEBUG_PAGEALLOC?) you really are getting into a "real" softlockup
 > because things are scaling so horribly badly.
 > 
 > If you now disable spinlock debugging and lockdep, hopefully that page
 > table lock now doesn't always get hung up on the lockdep locking, so
 > it starts scaling much better, and maybe you'd not see this...

I can give it a shot.  Hopefully there's some further mitigation that
could be done to allow a workload like this to survive under a debug
build though, as we've caught *so many* bugs with this stuff in the past.

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/