[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.00.1205191301430.10539@eggly.anvils>
Date: Sat, 19 May 2012 13:45:57 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Sam Portolla <samportolla@...oo.com>
cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"aarcange@...hat.com" <aarcange@...hat.com>
Subject: Re: exit_mmap BUG_ON in 2.6.23
On Fri, 18 May 2012, Sam Portolla wrote:
> [please cc samPortolla@...oo.com on your replies, not subscribed to the linux-kernel mailer]
>
> Hi, I have read the thread on same issue in 3.1:
> but this is happening on earlier GNU linux version 2.6.23 for x86_64,
> which does not have THP (I believe), nor it has huge_memory.c.
> Is there a fix one of you experts could supply? Issue is not reproducible
> so far, but happened on a customer site. Some info below.
>
> kernel BUG at .../bfc/linux/kernel-2.6.x/mm/mmap.c:2049!
>
> Line 2049 is in exit_mmap():
>
> BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
>
> RIP: 0010:[<ffffffff80277840>] [<ffffffff80277840>] exit_mmap+0xf0/0x100
> [snip]
> Call Trace:
> [<ffffffff8022ee14>] mmput+0x44/0xd0
> [<ffffffff802340a1>] exit_mm+0x91/0x100
> [<ffffffff802347ea>] do_exit+0x17a/0x960
> [<ffffffff8023c4bc>] __dequeue_signal+0xec/0x1b0
> [<ffffffff80235048>] do_group_exit+0x38/0x90
> [<ffffffff8023e3c6>] get_signal_to_deliver+0x2d6/0x4b0
> [<ffffffff8020b69a>] do_notify_resume+0xaa/0x760
> [<ffffffff8020c818>] retint_signal+0x3d/0x85
I've checked back through old ChangeLogs, and (apart from a UserModeLinux
case) I don't see any fix for a BUG_ON(nr_ptes) issue in between 2.6.19
and the much later THP issue, which you're right to think cannot be yours.
But the 2.6.19 case, and one which a video driver writer had more recently,
were both caused by unrelated code zeroing beyond what it had allocated:
happening to zero part of a higher-level page table, making it impossible
for task exit to locate all the page tables (and pages) it had to free.
Though I can't be sure, these BUG_ON(nr_ptes) reports do seem perhaps
too infrequent to be caused by bad logic in mm itself: I suspect memory
corruption in your case too.
There's no clue here as to what the cause might be, I'm afraid.
Rebuilding your kernel with CONFIG_DEBUG_PAGEALLOC=y, and slab debugging
on, might shed more light: but that's probably not something you want to
get into on a customer site, for a problem only seen once or twice.
The best I can suggest is for you to change that BUG_ON to a WARN_ON,
so at least the kernel doesn't crash there, and you might gather more
information from each time it happens; but you'll probably leak pages,
and may very well crash soon for other reasons (e.g. when evicting an
inode cannot locate all the maps of its pages).
Hugh
Powered by blists - more mailing lists