[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-id: <47045725.1070900@shaw.ca>
Date: Wed, 03 Oct 2007 20:59:49 -0600
From: Robert Hancock <hancockr@...w.ca>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Pekka Enberg <penberg@...helsinki.fi>,
Neil Romig <neil@...ig.demon.co.uk>,
linux-kernel@...r.kernel.org, hyoshiok@...aclelinux.com,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: File corruption when using kernels 2.6.18+
Linus Torvalds wrote:
>
> On Wed, 3 Oct 2007, Pekka Enberg wrote:
>> On 10/3/07, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>>> I would bet that the reason the intel-optimized memcpy triggers this is
>>> that the non-temporal stores just means that you go out directly on the
>>> bus, and it probably just shows a weakness in the chipset or bus that
>>> doesn't show with the normal cacheline accesses.
>> But that should show up with memtest too, no?
>
> Not unless memtest uses non-temporal stores with the same (or similar)
> access patterns.
>
> The thing is, the CPU cache hides a *lot* of activity from the chipset,
> and changes the access patterns radically.
>
> With normal cached accesses, you'd normally see just the "fill cacheline"
> and "write out cacheline" pattern. With movnt, you'd see non-cacheline
> accesses to memory. If the chipset was tested under mostly normal loads,
> the movnt cases have been getting a lot less coverage.
>
> Now, I do agree that it certainly *can* be a CPU bug too. I doubt it,
> though.
>
> I'd check the power supply (brownouts cause random corruption, and it
> might have a "peak power pattern" thing to it), and it's worth re-seating
> any DIMM's etc. And it's definitely worth going into the BIOS setup screen
> and making sure that nothing is even close to debatable (ie take RAM
> timings down to non-aggressive levels, make sure bus frequencies and
> multipliers are not even close to borderline, etc etc).
I didn't see what CPU this was, but there was this nasty erratum on some
Athlon 64/Opteron processors. I was trying to debug a problem someone
else mentioned a while ago (and which I could duplicate on my system)
where doing huge memsets in userspace (which glibc uses non-temporal
stores for) repeatedly would cause a system lockup or crash. Amazingly
enough after I upgraded the CPU from my old Athlon 64 3500+ to a new X2
4200+ the problem went away..
At the time I looked into whether this workaround could be applied in
the kernel if the BIOS failed to, but it seemed that accesses to the MSR
they mentioned failed, so I don't know what the story is..
from
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf
Erratum 97: 128-Bit Streaming Stores May Cause Coherency Failure
Description: Under a specific set of internal pipeline conditions, stale
data may be left in the L1 cache when a 128-bit streaming store (MOVNT*)
to a writeback (WB) memory type misses in the L1 data cache and both L1
and L2 TLBs.
Potential Effect on System
Memory coherence failures leading to unpredictable operation.
Suggested Workaround
BIOS should set DC_CFG.DIS_CNV_WC_SSO (bit 3 of MSR 0xC001_1022). The
performance effects of setting this bit are limited to streaming stores
to the write-combining (WC) memory type, a case expected to rarely occur
in actual usage. No loss of performance occurs in the general case (WB
memory type).
This workaround must not be applied to processors prior to revision C0.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@...pamshaw.ca
Home Page: http://www.roberthancock.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists