[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090125040246.GE9216@mit.edu>
Date: Sat, 24 Jan 2009 23:02:46 -0500
From: Theodore Tso <tytso@....edu>
To: Sven-Thorsten Dietrich <sdietrich@...ell.com>
Cc: Jon Masters <jcm@...hat.com>, Thomas Gleixner <tglx@...utronix.de>,
Lee Revell <rlrevell@...-job.com>,
linux-rt-users@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>,
williams <williams@...hat.com>,
"Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [RT] [RFC] simple SMI detector
On Sun, Jan 25, 2009 at 01:12:45PM +1100, Sven-Thorsten Dietrich wrote:
> Latency-friendly SMI implementations may be found in some systems, and
> not so much in others.
>
> It should be up to the promotional divisions of the particular hardware
> suppliers to assess the market pressure, and to report the performance
> advantages of their specific systems.
They definitely exist; I spent quite a bit of time working with IBM
hardware engineers on what we called "SMI remediation". In some cases
it was necessary to have some custom kernel and/or user space code to
run (at background priority) maintenance work such as predictive
failure calculations that really didn't need to be done at SMI
stop-the-OS-on-all-cpus priority, but which hardware manufacturers
would do at the SMI level to avoid needing to create
motherboard-specific device drivers and/or daemons for each OS that
they might want to support.
> I envision documentation of these specs, in a similar fashion to the
> manner in which CPU clock-cycles are documented for specific instruction
> executions - for those systems eligible (certified?) for low-latency
> operation.
I'm not sure exactly how one would do this certification, but I agree
that some kind of "real-time ready" logo/certification program would
make a huge amount of sense, with some standardized metrics of maximum
time spent in an SMI routine, and under what circumstances (in some
cases it occurs every 30-60 minutes; on other cases, only when the CPU
is about to melt itself into slag, or when there are ECC errors, etc.)
There is a huge difference between a system which stops the OS on all
CPU's dead in its tracks for milliseconds once every 45 minutes,
versus one which only triggers an SMI in extreme situations when the
hardware is about to destroy itself.
I will also note that for some applications (i.e., military hardware
running under battle conditions), where it might be that running the
hardware beyond its thermal limits might actually be *desirable*.
After all, an extra 15 minutes of running beyond thermal limits that
eventually causes the CPU to get flakey might be worth it if the
alternative is the ship getting sunk because the BIOS decided that
shutting down the CPU to save it from thermal damage was more
important than say, running the anti-aircraft guns....
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists