linux-kernel - Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101018213348.10281.qmail@kosh.dhis.org>
Date:	Mon, 18 Oct 2010 16:33:48 -0500 (GMT+5)
From:	pacman@...h.dhis.org
To:	benh@...nel.crashing.org (Benjamin Herrenschmidt)
Cc:	mel@....ul.ie (Mel Gorman), linux-mm@...ck.org,
	akpm@...ux-foundation.org (Andrew Morton),
	linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org
Subject: Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

Benjamin Herrenschmidt writes:
> 
> You can do something fun... like a timer interrupt that peeks at those
> physical addresses from the linear mapping for example, and try to find
> out "when" they get set to the wrong value (you should observe the load
> from disk, then the corruption, unless they end up being loaded
> incorrectly (ie. dma coherency problem ?) ...

I'm headed toward something like that. Maybe not a timer, maybe a "check it
every time the kernel is entered". But first I have to work out exactly when
the disk load completes so I know when to start checking.

> 
> >From there, you might be able to close onto the culprit a bit more, for
> example, try using the DABR register to set data access breakpoints
> shortly before the corruption spot. AFAIK, On those old 32-bit CPUs, you
> can set whether you want it to break on a real or a virtual address.

I thought of that, but as far as I can tell, this CPU doesn't have DABR.
/proc/cpuinfo
processor	: 0
cpu		: 7447/7457
clock		: 999.999990MHz
revision	: 1.1 (pvr 8002 0101)
bogomips	: 66.66
timebase	: 33333333
platform	: CHRP
model		: Pegasos2
machine		: CHRP Pegasos2
Memory		: 512 MB

My next thought was: right after the correct value appears in memory, unmap
the page from the kernel and let it Oops when it tries to write there. Then I
found out that the kernel is using BATs instead of page tables for its own
view of memory. Booting with "nobats" completely changes the memory usage
pattern (probably because it's allocating a lot of pages to hold PTEs that it
didn't need before)

> 
> You can also sprinkle tests for the page content through the code if
> that doesn't work to try to "close in" on the culprit (for example if
> it's a case of stray DMA, like a network driver bug or such).

No network drivers are loaded when this happens.

-- 
Alan Curry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/