[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120601091026.GC20959@aftab.osrc.amd.com>
Date: Fri, 1 Jun 2012 11:10:26 +0200
From: Borislav Petkov <bp@...64.org>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Borislav Petkov <bp@...64.org>,
Steven Rostedt <rostedt@...dmis.org>,
Mauro Carvalho Chehab <mchehab@...hat.com>,
Linux Edac Mailing List <linux-edac@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Aristeu Rozanski <arozansk@...hat.com>,
Doug Thompson <norsk5@...oo.com>,
Frederic Weisbecker <fweisbec@...il.com>,
Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller
events
On Thu, May 31, 2012 at 08:52:21PM +0000, Luck, Tony wrote:
> > It could be very quiet (i.e., machine runs with no errors) and it could
> > have bursts where it reports a large number of errors back-to-back
> > depending on access patterns, DIMM health, temperature, sea level and at
> > least a bunch more factors.
>
> Yes - the normal case is a few errors from stray neutrons ... perhaps
> a few per month, maybe on a very big system a few per hour. When something
> breaks, especially if it affects a wide range of memory addresses, then
> you will see a storm of errors.
IOW, when the sh*t hits the fan :-)
> > So I can imagine buffers filling up suddenly and fast, and userspace
> > having hard time consuming them in a timely manner.
>
> But I'm wondering what agent is going to be reporting all these
> errors. Intel has CMCI - so you can get a storm of interrupts
> which would each generate a trace record ... but we are working
> on a patch to turn off CMCI if a storm is detected.
Yeah, about that. What are you guys doing about losing CECCs when
throttling is on, I'm assuming there's no way around it?
> AMD doesn't have CMCI, so errors just report from polling
It does, look at <arch/x86/kernel/cpu/mcheck/mce_amd.c> That's the error
thresholding. We were talking about having an APIC interrupt fire at
_every_ CECC but I don't know/haven't tested how the software would
behave in such cases where the hw spits out an overwhelming amount of
errors.
> - and we have a
> maximum poll rate which is quite low by trace standards (even
> when multiplied by NR_CPUS).
>
> Will EDAC drivers loop over some chipset registers blasting
> out huge numbers of trace records ... that seems just as bad
> for system throughput as a CMCI storm. And just as useless.
Why useless?
I don't know but we need to be as slim as possible on the reporting side
for future use cases like that.
Also, we probably want to proactively do something about such storms
like offline pages or disable some hardware components so that they
subside.
Switching to polling mode IMHO only cures the symptom but not the
underlying cause.
> General principle: If there are very few errors happening then it is
> important to log every single one of them.
Absolutely.
> If there are so many that we can't keep up, then we must sample at
> some level, and we might as well do that at generation point.
Yes, and then take action to recover and stop the storm.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists