linux-kernel - Re: x86_mce: mce_start uses number of phsical cores instead of logical cores

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Sat, 11 May 2013 10:32:29 -0700
From:	Tony Luck <tony.luck@...il.com>
To:	Ming Lei <Ming.Lei@...erbed.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mchehab@...hat.com" <mchehab@...hat.com>,
	"bp@...en8.de" <bp@...en8.de>
Subject: Re: x86_mce: mce_start uses number of phsical cores instead of
 logical cores

> What I understand from above in intel 64 Arch software Developer's manual are:
> 1) this manual is written for software developer;
> 2) It says that MCE handler only requires to synchronize among the logical cores in the same package/core(what I assume here is same CPU socket).
>
> I have two CPU sockets on motherboard and total 24 logical cores(12 cores each CPU). Each CPU has its own integrated memory controller. Each memory controller controls three channels of DIMMs. I can understand that if one dimm has error, the memory controller can trigger the MCE exception to it's own CPU, but why should this memory controller sends the MCE exception to the other CPU or the rest CPUs on the motherboard? Is there any hardware standard or specification for it?

The Software Developer Manual is the specification of the architecture
- there are data sheets for each processor which describe
implementation details (e.g. perhaps which types of errors are
reported in whcih banks, an MCi_STATUS.MSCOD field values providing
more information about an error).

Your "1&2" understanding is correct. Your question on "why should this
memory controller send the MCE exception ..." is a good one. The
answer is because the architecture requires it; even though you and I
can imagine that it is possible for OS to do its work if the error is
just sent to the processors on the socket where the error was found in
some cases. There may be some cases where this is less easy (e.g. a
logical processor on one socket issues a NUMA read to a location that
is on the memory controller on the other socket).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/