[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45878E8A.6000506@yahoo.com.au>
Date: Tue, 19 Dec 2006 18:02:34 +1100
From: Nick Piggin <nickpiggin@...oo.com.au>
To: Dave Jones <davej@...hat.com>
CC: Andrew Morton <akpm@...l.org>, linux-kernel@...r.kernel.org,
Hugh Dickins <hugh@...itas.com>,
Chris Rankin <cj.rankin@...world.com>
Subject: Re: -mm merge plans for 2.6.20
Dave Jones wrote:
> On Tue, Dec 19, 2006 at 04:20:37PM +1100, Nick Piggin wrote:
> > Dave Jones wrote:
> >
> > > Eeek! page_mapcount(page) went negative! (-2)
> >
> > Hmm, probably happened once before, too.
>
> You're right. Going back further in the log, I noticed
> that it had happened again exactly at the time that cron restarted vpnc.
> The first time, the flags were different..
>
> Dec 4 00:01:03 firewall kernel: Eeek! page_mapcount(page) went negative! (-1)
> Dec 4 00:01:03 firewall kernel: page->flags = 400
> Dec 4 00:01:03 firewall kernel: page->count = 1
> Dec 4 00:01:03 firewall kernel: page->mapping = 00000000
Still reserved, with a NULL mapping. I'd say it could be the same page.
>
> > > page->flags = 404
> >
> > What's that? PG_referenced|PG_reserved? So I'd say it is likely
> > that some driver has got its refcounting wrong.
>
> At the time that it bit me, here's what was loaded..
>
> tun ipt_MASQUERADE iptable_nat ip_nat ipt_LOG xt_limit ipv6
> ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
> iptable_filter ip_tables x_tables video sbs i2c_ec button battery asus_acpi ac
> parport_pc lp parport pcspkr ide_cd i2c_viapro i2c_core cdrom 3c59x via_rhine
> via_ircc mii irda crc_ccitt serio_raw dm_snapshot dm_zero dm_mirror dm_mod ext3
> jbd ehci_hcd ohci_hcd uhci_hcd
>
> The scary ones (i2c, irda) weren't in use at all, and had never been opened afaik,
> so the potential for those to be corrupting memory is slim, but not out of the
> question. (Why the hell asus_acpi is loaded is a mystery, this isn't an Asus,
> or a laptop. Probably dumb initscripts).
OK that could be useful if I do some grepping and see which ones are
setting PG_reserved.
> > And I see we've got another report for 2.6.19.1 from Chris, which
> > is equally vague.
>
> I'll be moving that box to 2.6.19.x at some point real soon, so I'll holler
> if I see it again on a later kernel.
>
> > IMO the pattern is much too consistent to be able to attribute
> > them all to hardware problems. And considering it takes so long
> > for these things to appear, can we get something like the attached
> > patch upstream at least until we manage to stamp them out?
>
> Sounds like a good idea to me.
>
> ACKed-by: Dave Jones <davej@...hat.com>
Thanks.
>
> > Any other debugging info we can add?
>
> Would it be useful to print the pfn of the page ?
> In cases like mine, where it bit twice before it killed the box, it
> might be interesting to see if its always the same page. Not sure
> what that would prove/disprove though.
Might help. I guess the site where it is allocated from might be
another one, although I'm hoping that if we know what ->nopage is
being used then we'll be able to track it. OTOH it may be using
remap_pfn_range from fops->mmap, rather than nopage... I wonder
how we could get at that info? vma->vm_file->f_op->mmap?
--
SUSE Labs, Novell Inc.
View attachment "mm-rmap-debug-more.patch" of type "text/plain" (4402 bytes)
Powered by blists - more mailing lists