lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 11 May 2012 17:02:43 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Mauro Carvalho Chehab <mchehab@...hat.com>
CC:	Linux Edac Mailing List <linux-edac@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Doug Thompson <norsk5@...oo.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: RE: [PATCH v.23-2] RAS: use tracepoint to handle hw issues

> For example:
>
> mc_event: Corrected error:memory read on memory stick "DIMM_1A" (mc:0 channel:0 slot:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

This is looking so much better.

I looked through your examples from drivers on what text we might see
in the "memory read" position ... and agree that it would be a lot of
work to make them all come up with grammatically clean messages, especially
for all the poorly documented (or undocumented) "default/unknown/..." cases.

Back to my "does the casual user really need to know" soapbox. What different
actions do we expect a user to take when we tell them "read error" or "write
error" or "unknown error"?  I'm beginning to think that this belongs inside
the brackets! Perhaps as:  type:"memory read"?

Then we'd have:

mc_event: Corrected error on memory stick "DIMM_1A" (bunch of stuff for deep diagnosis by vendor)

Knowing that the error was Corrected/Uncorrected is vital to the user. It lets them know
the urgency with which they need to take action ... we need to educate them that a few
"Corrected" errors are perfectly normal and nothing to raise blood pressure about.

Knowing which memory stick was involved - also very important. If they do take action,
they should be able to swap out the memory stick that was the source of the problem.

Everything else is just for memory geeks like me, you and Boris (and OEMs who want to
diagnose root cause of problems they see by pattern analysis across errors from multiple
machines with DIMMS from different batches/vendors).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ