From: Joshua Wise <jwise@google.com>

Background:
 In some situations, mce_log would race against mce_read and deadlock. This
 race condition is described in more detail in the body of the associated
 e-mail.

Description:
 This patch allows the machine check exception handler to time out after
 spinning for too long waiting for the deadlock to finish. This may lose a
 machine check exception from time to time, but it's certainly better than
 bringing down the system.

Remaining issues:
 * Should the whole log structure just be rewritten as a ring buffer?

Testing:
 I injected single bit errors on CPU0 in a while loop overnight while
 running mced on CPU1. Previously, this would crash the system after some
 minutes, but the system survived the entire night this way.

Credits:
 Thanks to Tim Hockin <thockin@google.com> and Mike Waychison
 <mikew@google.com> for sitting down with me and combing through this to
 help me find the race.

Patch:
 This patch is against git 0471448f4d017470995d8a2272dc8c06dbed3b77.

Signed-off-by: Joshua Wise <joshua@joshuawise.com>

--

diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index aa1d159..87ff9dd 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -30,6 +30,8 @@ #include <asm/smp.h>
 #define MISC_MCELOG_MINOR 227
 #define NR_BANKS 6
 
+#define MCE_LOG_RETRIES 100000000 /* if we retry 100,000,000 times, we are probably in the synchronize_sched race */
+
 atomic_t mce_entry;
 
 static int mce_dont_init;
@@ -61,7 +63,7 @@ struct mce_log mcelog = { 
 
 void mce_log(struct mce *mce)
 {
-	unsigned next, entry;
+	unsigned next, entry, attempts = 0;
 	atomic_inc(&mce_events);
 	mce->finished = 0;
 	wmb();
@@ -70,6 +72,13 @@ void mce_log(struct mce *mce)
 		/* The rmb forces the compiler to reload next in each
 		    iteration */
 		rmb();
+		
+		/* Deal with the synchronize_sched race */
+		if (attempts++ > MCE_LOG_RETRIES) {
+			set_bit(MCE_OVERFLOW, &mcelog.flags);
+			return;
+		}
+		
 		for (;;) {
 			/* When the buffer fills up discard new entries. Assume
 			   that the earlier errors are the more interesting. */