lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sat, 11 May 2013 10:32:29 -0700
From:	Tony Luck <tony.luck@...il.com>
To:	Ming Lei <Ming.Lei@...erbed.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"mchehab@...hat.com" <mchehab@...hat.com>,
	"bp@...en8.de" <bp@...en8.de>
Subject: Re: x86_mce: mce_start uses number of phsical cores instead of
 logical cores

> What I understand from above in intel 64 Arch software Developer's manual are:
> 1) this manual is written for software developer;
> 2) It says that MCE handler only requires to synchronize among the logical cores in the same package/core(what I assume here is same CPU socket).
>
> I have two CPU sockets on motherboard and total 24 logical cores(12 cores each CPU). Each CPU has its own integrated memory controller. Each memory controller controls three channels of DIMMs. I can understand that if one dimm has error, the memory controller can trigger the MCE exception to it's own CPU, but why should this memory controller sends the MCE exception to the other CPU or the rest CPUs on the motherboard? Is there any hardware standard or specification for it?

The Software Developer Manual is the specification of the architecture
- there are data sheets for each processor which describe
implementation details (e.g. perhaps which types of errors are
reported in whcih banks, an MCi_STATUS.MSCOD field values providing
more information about an error).

Your "1&2" understanding is correct. Your question on "why should this
memory controller send the MCE exception ..." is a good one. The
answer is because the architecture requires it; even though you and I
can imagine that it is possible for OS to do its work if the error is
just sent to the processors on the socket where the error was found in
some cases. There may be some cases where this is less easy (e.g. a
logical processor on one socket issues a NUMA read to a location that
is on the memory controller on the other socket).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ