linux-kernel - Re: mm: BUG in unmap_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 8 Sep 2014 18:18:53 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Sasha Levin <sasha.levin@...cle.com>
Cc:	Hugh Dickins <hughd@...gle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Cyrill Gorcunov <gorcunov@...il.com>
Subject: Re: mm: BUG in unmap_page_range

On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote:
> On 08/29/2014 09:23 PM, Sasha Levin wrote:
> > On 08/27/2014 11:26 AM, Mel Gorman wrote:
> >> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> >> > index 281870f..ffea570 100644
> >> > --- a/include/asm-generic/pgtable.h
> >> > +++ b/include/asm-generic/pgtable.h
> >> > @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte)
> >> >  
> >> >  	VM_BUG_ON(!(val & _PAGE_PRESENT));
> >> >  
> >> > +	/* debugging only, specific to x86 */
> >> > +	VM_BUG_ON(val & _PAGE_PROTNONE);
> >> > +
> >> >  	val &= ~_PAGE_PRESENT;
> >> >  	val |= _PAGE_NUMA;
> > Triggered again, the first VM_BUG_ON got hit, the second one never did.
> 
> Okay, this bug has reproduced quite a few times since then that I no longer
> suspect it's random memory corruption. I'd be happy to try out more debug
> patches if you have any leads.
> 

The fact the second one doesn't trigger makes me think that this is not
related to how the helpers are called and is instead relating to timing.
I tried reproducing this but got nothing after 3 hours. How long does it
typically take to reproduce in a given run? You mentioned that it takes a
few weeks to hit but maybe the frequency has changed since. I tried todays
linux-next kernel but it didn't even boot so next-20140826 to match your
original report but got nothing. Can you also send me the config you used
in case that's a factor.

I had one hunch that this may somehow be related to a collision between
pagetable teardown during exit and the scanner but I could not find a
way that could actually happen. During teardown there should be only one
user of the mm and it can't race with itself.

A worse possibility is that somehow the lock is getting corrupted but
that's also a tough sell considering that the locks should be allocated
from a dedicated cache. I guess I could try breaking that to allocate
one page per lock so DEBUG_PAGEALLOC triggers but I'm not very
optimistic.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/