linux-kernel - Re: exit_mmap BUG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LSU.2.00.1205191301430.10539@eggly.anvils>
Date:	Sat, 19 May 2012 13:45:57 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Sam Portolla <samportolla@...oo.com>
cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"aarcange@...hat.com" <aarcange@...hat.com>
Subject: Re: exit_mmap BUG_ON in 2.6.23

On Fri, 18 May 2012, Sam Portolla wrote:
> [please cc samPortolla@...oo.com on your replies, not subscribed to the linux-kernel mailer]
> 
> Hi, I have read the thread on same issue in 3.1:
> but this is happening on earlier GNU linux version 2.6.23 for x86_64,
> which does not have THP (I believe), nor it has huge_memory.c.
> Is there a fix one of you experts could supply?  Issue is not reproducible
> so far, but happened on a customer site. Some info below.
> 
> kernel BUG at .../bfc/linux/kernel-2.6.x/mm/mmap.c:2049!
> 
> Line 2049 is in exit_mmap():
> 
> BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
> 
>  RIP: 0010:[<ffffffff80277840>]  [<ffffffff80277840>] exit_mmap+0xf0/0x100 
> [snip]
>  Call Trace:
>  [<ffffffff8022ee14>] mmput+0x44/0xd0
>  [<ffffffff802340a1>] exit_mm+0x91/0x100
>  [<ffffffff802347ea>] do_exit+0x17a/0x960
>  [<ffffffff8023c4bc>] __dequeue_signal+0xec/0x1b0
>  [<ffffffff80235048>] do_group_exit+0x38/0x90
>  [<ffffffff8023e3c6>] get_signal_to_deliver+0x2d6/0x4b0
>  [<ffffffff8020b69a>] do_notify_resume+0xaa/0x760
>  [<ffffffff8020c818>] retint_signal+0x3d/0x85

I've checked back through old ChangeLogs, and (apart from a UserModeLinux
case) I don't see any fix for a BUG_ON(nr_ptes) issue in between 2.6.19
and the much later THP issue, which you're right to think cannot be yours.

But the 2.6.19 case, and one which a video driver writer had more recently,
were both caused by unrelated code zeroing beyond what it had allocated:
happening to zero part of a higher-level page table, making it impossible
for task exit to locate all the page tables (and pages) it had to free.

Though I can't be sure, these BUG_ON(nr_ptes) reports do seem perhaps
too infrequent to be caused by bad logic in mm itself: I suspect memory
corruption in your case too.

There's no clue here as to what the cause might be, I'm afraid.
Rebuilding your kernel with CONFIG_DEBUG_PAGEALLOC=y, and slab debugging
on, might shed more light: but that's probably not something you want to
get into on a customer site, for a problem only seen once or twice.

The best I can suggest is for you to change that BUG_ON to a WARN_ON,
so at least the kernel doesn't crash there, and you might gather more
information from each time it happens; but you'll probably leak pages,
and may very well crash soon for other reasons (e.g. when evicting an
inode cannot locate all the maps of its pages).

Hugh