[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+8MBbLmeTb3808Xs4boH6KwDqfeSi69yzvHVpOMUzQg47bBZQ@mail.gmail.com>
Date: Wed, 14 Dec 2011 13:30:06 -0800
From: Tony Luck <tony.luck@...el.com>
To: Chen Gong <gong.chen@...ux.intel.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
Borislav Petkov <bp@...64.org>,
"Huang, Ying" <ying.huang@...el.com>,
Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Subject: Re: [PATCH 5/6] x86, mce: handle "action required" errors
On Wed, Dec 14, 2011 at 1:28 AM, Chen Gong <gong.chen@...ux.intel.com> wrote:
>> - if (kill_it&& tolerant< 3)
>>
>> + if (worst != MCE_AR_SEVERITY&& kill_it&& tolerant< 3)
>> force_sig(SIGBUS, current);
>
>
> I think here it should add more comments to clarify why not killing *AR*
> case.
> Such as: "for SRAR errors, such as DCU/IFU error, on affected logical
> processors, it is reasonable that RIPV is 0."
I'll look at this - the reason to not kill for AR is that we want to
try to recover
first (e.g. page could be re-read from disk into a different physical page).
In some cases we can recover transparently to the application.
>> - /* notify userspace ASAP */
>> - set_thread_flag(TIF_MCE_NOTIFY);
>> + if (worst == MCE_AR_SEVERITY) {
>
>
> how about adding one more condition check: mce_usable_address(&m) here?
I don't think it is needed - the table lookup in mce_severity() will only set
MCE_AR_SEVERITY if the ADDRV and MISCV bits are set in MCi_STATUS.
>> + mce_save_info(m.addr);
>> + set_thread_flag(TIF_MCE_NOTIFY);
>
>
> Here only SRAR error are flagged with TIF_MCE_NOTIFY, which means only SRAR
> error is handled in the function do_notify_resume. If so, SRAO error will
> only be handled in work_queue mce_work. If so, I think some related function
> names should be updated too. Otherwise, it will confuse people not touching
> these codes before.
Agreed - the names of the functions and the actions they perform haven't been
kept up to date.
>> void mce_notify_process(void)
>> {
>> + __u64 paddr = paddr;
>
>
> you mean "__u64 paddr = 0;"?
No. The "paddr = paddr" is a gcc'ism to silence a spurious "may be used
before set" warning. But the point will be moot in the next version because
changes inspired by Boris' comments mean that this line goes away.
> Does there exist some possibility that in the same process there are more
> than
> one error triggered? If so, maybe mce_find_info/mce_clear_info should be
> changed
> to loop-style, because here TIF_MCE_NOTIFY is cleared in the handler.
>
> Or it is impossible because overwritten will be covered by following
> condition:
I think that in current cpus it isn't possible to have more than one
error reported at the same time per process.
-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists