[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTimB5gEZ2S=b9EiiWR-_u+o+wEPyjw@mail.gmail.com>
Date: Wed, 15 Jun 2011 23:54:13 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Alexander Graf <agraf@...e.de>
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>,
linux-mm@...ck.org,
"linux-kernel@...r.kernel.org List" <linux-kernel@...r.kernel.org>
Subject: Re: Oops in VMA code
On Wed, Jun 15, 2011 at 11:20 PM, Alexander Graf <agraf@...e.de> wrote:
>
> On 16.06.2011, at 07:59, Linus Torvalds wrote:
>>
>> r26 has the value 0xc00090026236bbb0, and that "90" byte in the middle
>> there looks bogus. It's not a valid pointer any more, but if that "9"
>> had been a zero, it would have been.
>
> Please see my reply to Ben here.
Your reply to Ben seems to say that 0xc00000026236bbb0 wouldn't have
been a valid address, because you don't have that much memory.
But that's clearly not true. All the other registers have valid
pointers in them, and the stack pointer (r1) is c000000262987cd0, for
example. And that stack is clearly valid - if the kernel stack pointer
was corrupted, you'd never have gotten as far as reporting the oops.
So you may have only 8GB of RAM in that machine, but if so, there's
some empty unmapped physical space. Because clearly your RAM is _not_
limited to being mapped to below 0xc000000200000000.
To recap: I'm pretty sure the memory corruption is just the "90" byte.
The rest of the pointer looks too much like a pointer to be otherwise.
Whether that's due to a two-bit error (unlikely) or a wild byte write
(or 16-bit write with zeroes) is hard to say. USUALLY when we have
wild pointer errors, the corruption is more than just a few bits, but
it could have been something that sets a few bits in software, and
just sets them using a stale pointer.
> Yup, so let's keep this documented for now. Actually, the more I think about it the more it looks like simple random memory corruption by someone else in the kernel - and that's basically impossible to track and will give completely different bugs next time around :(.
We've had several bugs found by the pattern of the corruption, so I
wouldn't say "impossible to track". Even if the next time ends up
being a completely different oops (because the corruption happened in
a totally different kind of data structure), it might be possible that
there's that same "90" byte pattern, for example.
But it needs more than one bug report to see what the pattern is.
Usually it takes a _lot_ more..
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists