linux-kernel - RE: [PATCH 8/8] ACPI / trace: Add trace interface for eMCA driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3908561D78D1C84285E8C5FCA982C28F31D31C65@ORSMSX106.amr.corp.intel.com>
Date:	Wed, 16 Oct 2013 20:47:05 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Mauro Carvalho Chehab <m.chehab@...sung.com>,
	Borislav Petkov <bp@...en8.de>
CC:	"Naveen N. Rao" <naveen.n.rao@...ux.vnet.ibm.com>,
	"Chen, Gong" <gong.chen@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	Aristeu Rozanski Filho <arozansk@...hat.com>,
	Steven Rostedt <srostedt@...hat.com>
Subject: RE: [PATCH 8/8] ACPI / trace: Add trace interface for eMCA driver

> Also, I suspect that, if an error happens to affect more than one DIMM
> (e. g. part of the location is not available for a given error),
> that the DIMM label will also not be properly shown.

There are a couple of cases here:

1) There are a number of DIMMs behind some flaky h/w that introduces errors
that are apparently blamed onto each of those DIMMs.

  All we can do here is statistical correlations ... each error is reported independently,
  it is up to some entity to notice the higher level topology connection. There is enough
  information in the UEFI error record to do that (assuming that BIOS filled out the
  necessary fields).

2) There is a single reported error that spans more than one DIMM.

  This can happen with a UC error in a pair of lock-step DIMMs.  Since the error is UC
  we know that two (or more) bits are bad.  But we have no way to tell whether the
  bad bits came from the same DIMM, or one bit from each (because we don't know
  which bits are bad - if we knew that, we could fix them :-)   The eMCA case should
  log two subsections in this case - one for each of the lockstep DIMMs involved. A user
  seeing this will should probably just replace both DIMMs to be safe.  If they wanted to
  diagnose further they should swap DIMMs around so this pair are no longer lockstepped
  and see if they start seeing correctable errors from each of the split pair - or if the UC
  errors move with one or the other of the DIMMs

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/