lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F192F6DE2@ORSMSX104.amr.corp.intel.com>
Date:	Thu, 31 May 2012 20:52:21 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Borislav Petkov <bp@...64.org>,
	Steven Rostedt <rostedt@...dmis.org>
CC:	Mauro Carvalho Chehab <mchehab@...hat.com>,
	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Aristeu Rozanski <arozansk@...hat.com>,
	Doug Thompson <norsk5@...oo.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: RE: [PATCH] RAS: Add a tracepoint for reporting memory controller
 events

> It could be very quiet (i.e., machine runs with no errors) and it could
> have bursts where it reports a large number of errors back-to-back
> depending on access patterns, DIMM health, temperature, sea level and at
> least a bunch more factors.

Yes - the normal case is a few errors from stray neutrons ...  perhaps
a few per month, maybe on a very big system a few per hour.  When something
breaks, especially if it affects a wide range of memory addresses, then
you will see a storm of errors.

> So I can imagine buffers filling up suddenly and fast, and userspace
> having hard time consuming them in a timely manner.

But I'm wondering what agent is going to be reporting all these
errors.  Intel has CMCI - so you can get a storm of interrupts
which would each generate a trace record ... but we are working
on a patch to turn off CMCI if a storm is detected. AMD doesn't
have CMCI, so errors just report from polling - and we have a
maximum poll rate which is quite low by trace standards (even
when multiplied by NR_CPUS).

Will EDAC drivers loop over some chipset registers blasting
out huge numbers of trace records ... that seems just as bad
for system throughput as a CMCI storm. And just as useless.

General principle: If there are very few errors happening then
it is important to log every single one of them.  If there are
so many that we can't keep up, then we must sample at some level,
and we might as well do that at generation point.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ