[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <54ADCCF2.9060402@amd.com>
Date: Wed, 7 Jan 2015 18:18:58 -0600
From: Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>
To: Borislav Petkov <bp@...en8.de>
CC: <tglx@...utronix.de>, <mingo@...hat.com>, <hpa@...or.com>,
<tony.luck@...el.com>, <dougthompson@...ssion.com>,
<mchehab@....samsung.com>, <x86@...nel.org>,
<linux-kernel@...r.kernel.org>, <linux-edac@...r.kernel.org>,
<dave.hansen@...ux.intel.com>, <mgorman@...e.de>, <bp@...e.de>,
<riel@...hat.com>, <jacob.w.shin@...il.com>
Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors
On 1/7/2015 11:06 AM, Borislav Petkov wrote:
> On Tue, Jan 06, 2015 at 05:54:15PM -0600, Aravind Gopalakrishnan wrote:
>> But we still need to change the error injection interfaces in mce_amd_inj:
>> mce_amd_inj triggers a #MC on the cpu number that the user specifies on
>> debugfs.
>> For any error other than MC4 errors, this is fine.
>> But we should really be triggering #MC only on NBC for MC4 errors.
> Why?
>
> As you said yourself, the errors get reported on the NBC. Where they get
> *triggered* is a different story.
Apologies if I was not clear earlier. Let me try to address the issue again-
I shall be verbose for sake of clarity here..
The bank 4 MSRs are per-node and per-node MSR are shared between cores
in a node.
So, technically, all cores of the same node have access to the MSR.
But, since D18F3x44[NBMstToMstCpuEn] is set, access is restricted to
only the NBC.
And, BKDG states that-
reads of these MSRs from other cores return 0’s and writes are ignored.
Now, with mce_amd_inj interface as it is right now, we basically
wrmsr_on_cpu()
to the MCx_[status|addr|misc] registers using the cpu value user
specifies at /sys/kernel/debug/mce-inject/cpu.
For a bank4 error (assume a UC case here) to a non-NBC (say core 6 of
first node in a multi-node platform),
mce_amd_inj will simply wrmsr_on_cpu(6,...).
Since writes are ignored, we basically don't populate any info on the
MSRs and when you trigger_mce on cpu 6,
do_machine_check will try to read status MSR for cpu6 which causes RAZ
and you basically would not see any output on dmesg.
(This is why I had originally thought we had dropped MCEs)
If the same error were to be introduced on a NBC (core 0 in the above
example),
(i.e), user were to provide cpu number 0 on
/sys/kernel/debug/mce-inject/cpu; then we would see output on dmesg.
This is because writes from cpu 0 to the MSR will go through.
This is the correction I have made in patch 3 where, for bank = 4, I
find the NBC for the given cpu and write the MSRs using the nbc value.
(I still need to modify the patch to also trigger #MC on the NBC)
Also, just to clarify any terminology issues:
'reporting' of errors means: active notification of errors to software
via machine check exceptions.
(as defined by BKDG in the section "Error Detection, Action, Logging,
and Reporting".
It's section 2.13.1.3 on a F15h M0h-0fh BKDG rev 3.14 for me.
section number might vary for you depending on the document version you
are referring to..)
> We do injection as it is described in "2.15.2 Error Injection and
> Simulation" in F15h BKDG, for example. Reporting of the thusly injected
> bank4 error goes to the NBC.
>
>
Just want to clarify some (potential) terminology issues here too:
"Error injection" is causing a DRAM error by writing to D18F3xB8 and
D18F3xBC.
If a DRAM error were to be introduced by using above method, then HW
should correctly 'report' the error to NBC.
"Error simulation" is basically what we are doing in mce_amd_inj.
But before we drive a #MC, we should honor the rules specified in the
BKDG wrt writing of MSRs IMHO. (specifically for the bank=4 case)
Thanks,
-Aravind.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists