linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFz6KEKdz4Nxkx2fa-FH3PZ+Aa51iA_nXobsQ-dDW5PGEg@mail.gmail.com>
Date:	Mon, 15 Dec 2014 15:46:41 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Dave Jones <davej@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> So let's just fix it. Here's a completely untested patch.

So after looking at this more, I'm actually really convinced that this
was a pretty nasty bug.

I'm *not* convinced that it's necessarily *your* bug, but I still
think it could be.

I cleaned up the patch a bit, split it up into two to clarify it, and
have committed it to my tree. I'm not marking the patches for stable,
because while I'm convinced it's a bug, I'm also not sure why even if
it triggers it doesn't eventually recover when the IO completes. So
I'd mark them for stable only if they are actually confirmed to fix
anything in the wild, and after they've gotten some testing in
general. The patches *look* straightforward, they remove more lines
than they add, and I think the code is more understandable too, but
maybe I just screwed up. Whatever. Some care is warranted, but this is
the first time I feel like I actually fixed something that matched at
least one of your lockup symptoms.

Anyway, it's there as

  26178ec11ef3 ("x86: mm: consolidate VM_FAULT_RETRY handling")
  7fb08eca4527 ("x86: mm: move mmap_sem unlock from mm_fault_error() to caller")

and I'll continue to look at the page fault patch. I still have a
slight worry that it's something along the lines of corrupted page
tables or some core VM issue, but I apart from my general nervousness
about the auto-numa code (which will be cleaned up eventually though
the pte_protnone patches), I can't actually see how you'd get into
endless page faults any other way. So I'm really hoping that the buggy
VM_FAULT_RETRY handling explains it.

But me not seeing any other bug clearly doesn't mean it doesn't exist.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/