linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5475596A.9010301@suse.com>
Date:	Wed, 26 Nov 2014 05:39:06 +0100
From:	Jürgen Groß <jgross@...e.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	the arch/x86 maintainers <x86@...nel.org>
Subject: Re: frequent lockups in 3.18rc4

On 11/26/2014 02:48 AM, Linus Torvalds wrote:
> On Tue, Nov 25, 2014 at 4:25 PM, Dave Jones <davej@...hat.com> wrote:
>>
>> The reason I'm checking in at this point, is that I'm starting to see different
>> bugs at this point, so I don't know if I can call this good or bad, unless
>> someone has a fix for what I'm seeing now.
>
> Hmm. The three last "bad" biisects are all just 3.17-rc1 plus staging fixes.
>
>> Reminiscent of a bug a couple releases ago. Processes about to exit, but stuck
>> in the kernel continuously faulting..
>> http://codemonkey.org.uk/junk/weird-hang.txt
>> The one I'm thinking of got fixed way before 3.17 though.
>
> Well, the staging tree was based on that 3.17-rc1 tree, so it may well
> have the bug without the fix.
>
> You have also marked 3.18-rc1 bad *twice*, along with the network
> merge, and the tty merge. That's just odd. But it doesn't make the
> bisect wrong, it just means that you fat-fingered thing and marked the
> same thing bad a couple of times.
>
> Nothing to worry about, unless it's a sign of early Parkinsons...
>
>> Does that trace ring a bell of something else I could try on top of
>> each bisection point ?
>
> Hmm.
>
> Smells somewhat like the "pipe/page fault oddness" bug you reported.
>
> That one caused endless page faults on fault_in_pages_writeable()
> because of a page table entry that the VM thought was present, but the
> CPU thought was missing.
>
> That caused the whole "pte_protnone()" thing, and trying to get rid of
> the PTE_NUMA bit, but those patches have *not* been merged. And you
> were ever able to reproduce it., so we left it as pending.
>
> But if you actually really think that the bisect log you posted is
> real and true and actually is the bug you're chasing, I have bad news
> for you: do a "gitk --bisect", and you'll see that all the remaining
> commits are just to staging drivers.
>
> So that would either imply you have some staging driver (unlikely), or
> more likely that 3.17 really already has the problem, it's just that
> it needs some particular code alignment or phase of the moon or
> something to trigger.

I COULD trigger it with 3.17. Took much longer, but I've seen it once.
And from Xen hypervisor data it was clear it was the same bug (cpu
spinning in pmd_lock()).


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/