[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200707122320.19489.ak@suse.de>
Date: Thu, 12 Jul 2007 23:20:19 +0200
From: Andi Kleen <ak@...e.de>
To: Joshua Wise <jwise@...gle.com>
Cc: akpm@...gle.com, linux-kernel@...r.kernel.org,
Tim Hockin <thockin@...gle.com>,
Mike Waychison <mikew@...gle.com>
Subject: Re: [x86_64 MCE] [RFC] mce.c race condition (or: when evil hacks are the only options)
On Thursday 12 July 2007 22:24:36 Joshua Wise wrote:
> Some time after my first[1] patch to mce.c, I had continued testing using
> the test method described there (inject hundreds of thousands of single-bit
> errors while reading from /dev/mcelog), and I came across a strange issue.
> Once every couple test cycles (i.e., every hundred-thousand injects), the
> system would begin behaving very strangely. The serial port would stop
> responding entirely, and my SSH sessions would respond, but to one keystroke
> behind what I had typed. After two minutes, the watchdog timer would reboot
> the system.
Thanks for the analysis.
I guess at some point it would make sense to model the whole thing in spin or
similar and find even the last race. Anyone interested? @)
>
> At this point, mce_read() and mce_log() are blocking on each other, and will
> be for all eternity. mce_log() is waiting for the on_each_cpu() to complete
> so that the loop can complete, and on_each_cpu() (from mce_read()) is
> waiting for mce_log() to finish.
I suspect the right fix is to get rid of collect_tscs().
One possible way would be to get the TSCs from the mce_log table
before the synchronization instead of asking the CPUs.
Or switch it over to the NMI supporting RCU that was recently posted.
> -- there may be other edge cases other than
> this one. I'm actually surprised that this wasn't a ring buffer to start
> with -- it certainly seems like it wanted to be one.
The problem with a ring buffer is that it would lose old entries; but
for machine checks you really want the first entries because
the later ones might be just junk.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists