lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1274930481.3444.258.camel@yhuang-dev.sh.intel.com>
Date:	Thu, 27 May 2010 11:21:21 +0800
From:	Huang Ying <ying.huang@...el.com>
To:	Jin Dongming <jin.dongming@...css.fujitsu.com>
Cc:	LKLM <linux-kernel@...r.kernel.org>,
	Andi Kleen <ak@...ux.intel.com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Subject: Re: [Patch-next] Remove notify_die in do_machine_check functioin

On Thu, 2010-05-27 at 10:40 +0800, Jin Dongming wrote:
> This patch fixes do_machine_check() failure caused by DIE_NMI.
> 
> I do MCE tests on my machine. When I inject Uncorrected Error(UE) into
> kernel, the messages of test failure are always gotten. This problem
> is caused by the notification of DIE_NMI in the front of do_machine_check().
> Because there are some notifications used DIE_NMI, and when they finish their
> own work and return NOTIFY_STOP as a result. The result makes
> do_machine_check() return at that time.
> 
> So we decide to delete the notification of DIE_NMI. It is because when UE error
> happens, if one of the cpu is down caused by the error of hook function of
> DIE_NMI, the error type of UE may be different with the real one. For example,
> 
>         CPU0                                  CPU1
> UE      do_machine_check()                    do_machine_check()
>         |                                     |
>         cpu down(hook error of DIE_NMI)       cpu OK(no hook error of DIE_NMI)
>                                               |
>                                               wait CPU0 timeout
>                                               |
>                                               Fatal Error
>                                               (Timeout synchronizing machine
>                                                check over CPUs)

Fatal error will only occur if tolerant = 0, which is not the common
case.

But I think the notify_die can be an issue here. For example UE is on
CPU0, and the MCE is consumed by notify_die; MCE on CPU1 will detect
nothing.

I have heard about that on some machine, some hardware error output pin
of chipset may be linked with some input pin of CPU which can cause MCE.
That is, MCE is used to report some chipset errors too. I think that is
why notify_die is called in do_machine_check. Simply removing notify_die
is not good for these machines.

Maybe we should fix the notifier user instead. Which notifier user
consumes the DIE_NMI notification?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ