linux-kernel - Re: [PATCH 5/6] x86, mce: handle "action required" errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+8MBbLmeTb3808Xs4boH6KwDqfeSi69yzvHVpOMUzQg47bBZQ@mail.gmail.com>
Date:	Wed, 14 Dec 2011 13:30:06 -0800
From:	Tony Luck <tony.luck@...el.com>
To:	Chen Gong <gong.chen@...ux.intel.com>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Borislav Petkov <bp@...64.org>,
	"Huang, Ying" <ying.huang@...el.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Subject: Re: [PATCH 5/6] x86, mce: handle "action required" errors

On Wed, Dec 14, 2011 at 1:28 AM, Chen Gong <gong.chen@...ux.intel.com> wrote:
>> -       if (kill_it&&  tolerant<  3)
>>
>> +       if (worst != MCE_AR_SEVERITY&&  kill_it&&  tolerant<  3)
>>                force_sig(SIGBUS, current);
>
>
> I think here it should add more comments to clarify why not killing *AR*
> case.
> Such as: "for SRAR errors, such as DCU/IFU error, on affected logical
> processors, it is reasonable that RIPV is 0."

I'll look at this - the reason to not kill for AR is that we want to
try to recover
first (e.g. page could be re-read from disk into a different physical page).
In some cases we can recover transparently to the application.
>> -       /* notify userspace ASAP */
>> -       set_thread_flag(TIF_MCE_NOTIFY);
>> +       if (worst == MCE_AR_SEVERITY) {
>
>
> how about adding one more condition check: mce_usable_address(&m) here?

I don't think it is needed - the table lookup in mce_severity() will only set
MCE_AR_SEVERITY if the ADDRV and MISCV bits are set in MCi_STATUS.

>> +               mce_save_info(m.addr);
>> +               set_thread_flag(TIF_MCE_NOTIFY);
>
>
> Here only SRAR error are flagged with TIF_MCE_NOTIFY, which means only SRAR
> error is handled in the function do_notify_resume. If so, SRAO error will
> only be handled in work_queue mce_work. If so, I think some related function
> names should be updated too. Otherwise, it will confuse people not touching
> these codes before.

Agreed - the names of the functions and the actions they perform haven't been
kept up to date.

>>  void mce_notify_process(void)
>>  {
>> +       __u64   paddr = paddr;
>
>
> you mean "__u64 paddr = 0;"?

No. The "paddr = paddr" is a gcc'ism to silence a spurious "may be used
before set" warning.  But the point will be moot in the next version because
changes inspired by Boris' comments mean that this line goes away.

> Does there exist some possibility that in the same process there are more
> than
> one error triggered? If so, maybe mce_find_info/mce_clear_info should be
> changed
> to loop-style, because here TIF_MCE_NOTIFY is cleared in the handler.
>
> Or it is impossible because overwritten will be covered by following
> condition:

I think that in current cpus it isn't possible to have more than one
error reported at the same time per process.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/