[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4CE69B9B.7020302@cs.columbia.edu>
Date: Sat, 20 Nov 2010 13:05:15 -0500 (EST)
From: Oren Laadan <orenl@...columbia.edu>
To: Tejun Heo <tj@...nel.org>
cc: Serge Hallyn <serge.hallyn@...onical.com>,
Kapil Arya <kapil@....neu.edu>,
Gene Cooperman <gene@....neu.edu>,
linux-kernel@...r.kernel.org, xemul@...ru,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Linux Containers <containers@...ts.osdl.org>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
Hi,
Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches
[now relax, grab (another) cup of coffee and read on...]
PART I: ==PERSPECTIVE==
A rough classification of c/r categories:
* container-c/r: important use-case, e.g. c/r and migration of an
application containers like VPS (virtual private server), VDI
(desktop) or other self-contained application (e.g. Oracle server).
Here _all_ the relevant processes are included in the checkpoint.
* standalone-c/r: another use-case is standalone-c/r where a set of
processes is checkpointed, but not the entire environment, and then
those processes are restarted in a different "eco-system".
* distributed-c/r: meaning several sets of processes, each running
on a different host. (Each set may be a separate container there).
In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.
In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.
There are generally three types of "glue":
(1) Modify the application or selected libraries to be c/r-aware, and
notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
that will glue in what's missing (e.g. dbus or nscd calls to
reconnect an application to those services).
IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.
(next part: linux-cr philosophy...)
Thanks,
Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists