lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 21 May 2014 21:09:45 +0000
From:	"Luck, Tony" <tony.luck@...el.com>
To:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	Chen Yucong <slaoub@...il.com>
CC:	"bp@...en8.de" <bp@...en8.de>,
	"ak@...ux.intel.com" <ak@...ux.intel.com>,
	"Huang, Ying" <ying.huang@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: RE: [PATCH v2] x86/mce: Distirbute the clear operation of mces_seen
 to Per-CPU rather than only monarch CPU

>> mce_regin, which is only called by monarch CPU, can be used for system
>> panics as quickly as possible if there is a truly data corrupting error.
>> But Monarch CPU don't have to help all other CPU to clean mces_clean.
>> One advantage of Per-CPU is the isolation of errors propagation, being
>> so, why do not we clean mces_seen by Per-CPU?
>
> What kind of error propagations are you expecting/concerning here?
> Could you explain the problem more in detail?

Please do give us more detail on the scenario that you see that would
make your new version behave better.

I'm sure the current code has no races w.r.t. clearing mces_seen. The
monarch clears them all in mce_reign() before clearing mce_executing
at the foot of mce_end() and allowing the others to run again.

Your code has the monarch release all the other cpus from the spinloop
in mce_end() so they will all rush together through the final lines of
do_machine_check().  Some of them will have work to do if they saw
errors - they may have to send signals, or log the error. Others can
fly directly to the end of do_machine_check() and clear MCG_STATUS
and return to executing whatever code was interrupted.

So it is possible that some processors will be out doing things that can
generate another machine check, before others have finished their
tasks and got to the point to clear mces_seen.(*)

-Tony

(*) maybe that doesn't matter because they haven't zeroed MCG_STATUS
yet - so this second machine check will force those cpus to shutdown. See MCIP
description in "15.3.1.2 IA32_MCG_STATUS_MSR" section of software
developer manual.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ