linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20141119214743.GA18883@redhat.com>
Date:	Wed, 19 Nov 2014 16:47:43 -0500
From:	Dave Jones <davej@...hat.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Don Zickus <dzickus@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: frequent lockups in 3.18rc4

On Wed, Nov 19, 2014 at 01:01:36PM -0800, Andy Lutomirski wrote:
 
 > TIF_NOHZ is not the same thing as NOHZ.  Can you try a kernel with
 > CONFIG_CONTEXT_TRACKING=n?  Doing that may involve fiddling with RCU
 > settings a bit.  The normal no HZ idle stuff has nothing to do with
 > TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
 > thread_info corruption going on here.

I'll try that next.

 > > RSP: 0018:ffff880192d2fee8  EFLAGS: 00000246
 > > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
 > 
 >                                     ^^^^^^^^^
 > 
 > That is a strange coincidence.  Where did 0x46 | (1<<32) come from?
 > That's a sensible interrupts-disabled flags value with the high part set
 > to 0x1.  Those high bits are undefined, but they ought to all be zero.

This box is usually pretty solid, but it's been in service as a 24/7
fuzzing box for over a year now, so it's not outside the realm of
possibility that this could all be a hardware fault if some memory
has gone bad or the like.  Unless we find something obvious in the
next few days, I'll try running memtest over the weekend (though
I've seen situations where that doesn't stress hardware enough to
manifest a problem, so it might not be entirely conclusive unless
it actually finds a fault).

I wish I had a second identical box to see if it would be reproducible.

 > >  [<ffffffff941689c6>] perf_read+0x226/0x370
 > >  [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
 > >  [<ffffffff941eafff>] vfs_read+0x9f/0x180
 > >  [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
 > >  [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
 > 
 > Riddle me this: what are we doing in tracesys_phase2?  This is a full
 > slow-path syscall.  TIF_NOHZ doesn't cause that, I think.  I'd love to
 > see the value of ti->flags here.  Is trinity using ptrace?
 
That's one of the few syscalls we actually blacklist (mostly because it
requires some more thinking: just passing it crap can get the fuzzer
into a confused state where it thinks child processes are dead, when
they aren't etc).  So it shouldn't be calling ptrace ever.

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/