[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080422190901.GA1104@elte.hu>
Date: Tue, 22 Apr 2008 21:09:01 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Jiri Slaby <jirislaby@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
"Rafael J. Wysocki" <rjw@...k.pl>, paulmck@...ux.vnet.ibm.com,
David Miller <davem@...emloft.net>,
linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
linux-ext4@...r.kernel.org, herbert@...dor.apana.org.au,
Zdenek Kabelac <zdenek.kabelac@...il.com>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: 2.6.25-git2: BUG: unable to handle kernel paging request at
ffffffffffffffff
* Ingo Molnar <mingo@...e.hu> wrote:
> > Yesterday I did 2 suspend/resumes after 1 hour of uptime and ran
> > git-status for a fraction of a second until it was killed. So I can
> > perfectly reproduce it when I suspend, resume and produce some io
> > load. I guess it's time to bisect 2.6.25-rc8-mm2 as I'm able to
> > reproduce it the best and haven't seen that bug in -rc8-mm1 for over
> > week of suspending and working.
>
> the most dangerous x86 change we added was the PAT stuff. Does it
> influence the crashes in any way if you boot with 'nopat' or if you
> disable CONFIG_X86_PAT=y into the .config?
note that full PAT (where in essence Linux takes over control of the
cache attributes via PTEs, instead of relying on the BIOS initialized
MTRRs alone) you should only get with -mm or with x86.git applied.
I.e. x86 PAT might explain any -mm issue but not the upstream -git
issue.
In upstream -git we dont have the second wave of the PAT changes applied
yet (the /dev/mem bits) so CONFIG_X86_PAT is not yet activated. (it's
only safe to enable if we have all the changes together and perfectly
control all cache attributes in the system)
i.e. PAT complications here would not happen in form of real cache
attribute conflicts [i.e. the lockups and corruptions cannot be due to
that] - but as side-effects to other code it changes.
and most of the PAT failures we ever saw had different patterns anyway:
the leading failure was API rejections and hence non-working Xorg or
non-working ioremap() in certain drivers. The worst-case scenario, early
in the PAT code's cycle, was a spontaneous triple fault - months ago.
the basis for the PAT changes was the hardening of the CPA code and its
general use for everything (such as DEBUG_PAGEALLOC). And much of that
happened and was finished in v2.6.25. Nothing conceptually new really
happened there - and even where we touched the code in .26 it happened
long ago and would have surfaced by now.
... but ... nothing can be excluded.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists