[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090430115741.GA23634@aftab>
Date: Thu, 30 Apr 2009 13:57:41 +0200
From: Borislav Petkov <borislav.petkov@....com>
To: Andi Kleen <andi@...stfloor.org>
CC: akpm@...ux-foundation.org, greg@...ah.com, mingo@...e.hu,
tglx@...utronix.de, hpa@...or.com, dougthompson@...ssion.com,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 00/21 v2] amd64_edac: EDAC module for AMD64
Hi,
On Wed, Apr 29, 2009 at 09:30:31PM +0200, Andi Kleen wrote:
> Borislav Petkov <borislav.petkov@....com> writes:
>
> > Hi,
> >
> > thanks to all reviewers of the previous submission, here is the second
> > version of this series.
>
> The classic problem of the previous versions of these patches was that
> they consume the same error registers (even if using pci config versus
> msrs as access methods) as the kernel machine check poll/threshold
> interrupt code. And with two logging agents racing on the same
> registers you will always get junk results. Typically with threshold
> enabled the mce code wins the race. I suspect this patchkit has
> exactly the same fundamental design problem. EDAC really is not
> particularly fitting for integrated memory controllers that report
> their errors using standard machine check events.
ok, how about we remove tha MSR/PCI cfg space reading bits and leave
that task solely to the mce core. Then, iff you have edac turned on in
Kconfig, mce code delivers needed error info to edac which, in turn,
goes and decodes the error/does the mapping to DIMM blocks/supplies DRAM
error injection facility for testing purposes and similar things. That
way you have both and they don't overlap in functionality.
By the way, I think there's a similar attempt/proposal of letting mce
and edac talk to each other from Red Hat so I think this could be a
viable thing to try.
> -Andi (who thinks all of this decoding should be in user space anyways)
Think of a big data center with a thousands of 2,4,8 socket blades
and the admin collecting mce output and running around decoding the
errors on his workstation. Even worse, the blades have different DIMM
configurations due to hw upgrades/newer machines. I'd much rather have
the complete decoding done in kernel, where all the information needed
for proper decoding is present and with the error landing in syslog or
some other monitored buffer instead of reconstructing it in userspace.
Thanks.
--
Regards/Gruss,
Boris.
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
Research | Geschäftsführer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
(OSRC) | Registergericht München, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists