linux-kernel - Re: [Patch-next] Remove notify_die in do_machine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1274930481.3444.258.camel@yhuang-dev.sh.intel.com>
Date:	Thu, 27 May 2010 11:21:21 +0800
From:	Huang Ying <ying.huang@...el.com>
To:	Jin Dongming <jin.dongming@...css.fujitsu.com>
Cc:	LKLM <linux-kernel@...r.kernel.org>,
	Andi Kleen <ak@...ux.intel.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Subject: Re: [Patch-next] Remove notify_die in do_machine_check functioin

On Thu, 2010-05-27 at 10:40 +0800, Jin Dongming wrote:
> This patch fixes do_machine_check() failure caused by DIE_NMI.
> 
> I do MCE tests on my machine. When I inject Uncorrected Error(UE) into
> kernel, the messages of test failure are always gotten. This problem
> is caused by the notification of DIE_NMI in the front of do_machine_check().
> Because there are some notifications used DIE_NMI, and when they finish their
> own work and return NOTIFY_STOP as a result. The result makes
> do_machine_check() return at that time.
> 
> So we decide to delete the notification of DIE_NMI. It is because when UE error
> happens, if one of the cpu is down caused by the error of hook function of
> DIE_NMI, the error type of UE may be different with the real one. For example,
> 
>         CPU0                                  CPU1
> UE      do_machine_check()                    do_machine_check()
>         |                                     |
>         cpu down(hook error of DIE_NMI)       cpu OK(no hook error of DIE_NMI)
>                                               |
>                                               wait CPU0 timeout
>                                               |
>                                               Fatal Error
>                                               (Timeout synchronizing machine
>                                                check over CPUs)

Fatal error will only occur if tolerant = 0, which is not the common
case.

But I think the notify_die can be an issue here. For example UE is on
CPU0, and the MCE is consumed by notify_die; MCE on CPU1 will detect
nothing.

I have heard about that on some machine, some hardware error output pin
of chipset may be linked with some input pin of CPU which can cause MCE.
That is, MCE is used to report some chipset errors too. I think that is
why notify_die is called in do_machine_check. Simply removing notify_die
is not good for these machines.

Maybe we should fix the notifier user instead. Which notifier user
consumes the DIE_NMI notification?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/