[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130916182450.639084c6@skate>
Date: Mon, 16 Sep 2013 18:24:50 +0200
From: Thomas Petazzoni <thomas.petazzoni@...e-electrons.com>
To: Russell King - ARM Linux <linux@....linux.org.uk>
Cc: Willy Tarreau <w@....eu>, Andrew Lunn <andrew@...n.ch>,
Jason Cooper <jason@...edaemon.net>, netdev@...r.kernel.org,
Ethan Tuttle <ethan@...antuttle.com>,
Ezequiel Garcia <ezequiel.garcia@...e-electrons.com>,
Gregory Clément
<gregory.clement@...e-electrons.com>,
linux-arm-kernel@...ts.infradead.org
Subject: Re: mvneta: oops in __rcu_read_lock on mirabox
Russell,
On Mon, 16 Sep 2013 17:22:09 +0100, Russell King - ARM Linux wrote:
> One seemed to be a single bit error in an instruction inside the kernel
> image. The other was what seems to be an impossible abort.
>
> I still don't see how we could end up with a prefetch abort inside memset()
> due to the kernel domain being inaccessible, but still be able to get
> an oops out, especially when we dump out the memory for the faulting
> instruction by accessing that memory via that apparantly inaccessible
> domain while running the code which dumps that memory also under this
> apparantly inaccessible domain. If the domain containing the kernel
> really was inaccessible, the system would be completely dead.
>
> The only possibilities I can come up with for that is that abort was
> caused by something spurious happening at the hardware level causing
> corruption of the instruction TLB (corrupting the domain index stored
> in the I-TLB) or other CPU control hardware causing it to spuriously
> generate that fault.
>
> As the domain field in the page table L1 entries covers bit 8, and the
> single bit error with the instruction was also bit 8, maybe there's a
> design weakness on data line bit 8 causing marginal operation.
>
> To add to this, the abort given in this report gives an IFSR value of
> 0x409, which equates to "Synchronous parity error on memory access"
> in ARMv7. The other value (0x400) equates to "TLB conflict abort"
> which can only happen with LPAE support enabled... So this is just
> getting more weird!
Could this be caused by bitflips in the RAM due to bad timings, or
overheating or that kind of things?
Thomas
--
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists