linux-kernel - RE: [PATCH] RAS: Add a tracepoint for reporting memory controller events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3908561D78D1C84285E8C5FCA982C28F192F6DE2@ORSMSX104.amr.corp.intel.com>
Date:	Thu, 31 May 2012 20:52:21 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...64.org>,
	Steven Rostedt <rostedt@...dmis.org>
CC:	Mauro Carvalho Chehab <mchehab@...hat.com>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Aristeu Rozanski <arozansk@...hat.com>,
	Doug Thompson <norsk5@...oo.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: RE: [PATCH] RAS: Add a tracepoint for reporting memory controller
 events

> It could be very quiet (i.e., machine runs with no errors) and it could
> have bursts where it reports a large number of errors back-to-back
> depending on access patterns, DIMM health, temperature, sea level and at
> least a bunch more factors.

Yes - the normal case is a few errors from stray neutrons ...  perhaps
a few per month, maybe on a very big system a few per hour.  When something
breaks, especially if it affects a wide range of memory addresses, then
you will see a storm of errors.

> So I can imagine buffers filling up suddenly and fast, and userspace
> having hard time consuming them in a timely manner.

But I'm wondering what agent is going to be reporting all these
errors.  Intel has CMCI - so you can get a storm of interrupts
which would each generate a trace record ... but we are working
on a patch to turn off CMCI if a storm is detected. AMD doesn't
have CMCI, so errors just report from polling - and we have a
maximum poll rate which is quite low by trace standards (even
when multiplied by NR_CPUS).

Will EDAC drivers loop over some chipset registers blasting
out huge numbers of trace records ... that seems just as bad
for system throughput as a CMCI storm. And just as useless.

General principle: If there are very few errors happening then
it is important to log every single one of them.  If there are
so many that we can't keep up, then we must sample at some level,
and we might as well do that at generation point.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/