linux-kernel - Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4CE69B9B.7020302@cs.columbia.edu>
Date:	Sat, 20 Nov 2010 13:05:15 -0500 (EST)
From:	Oren Laadan <orenl@...columbia.edu>
To:	Tejun Heo <tj@...nel.org>
cc:	Serge Hallyn <serge.hallyn@...onical.com>,
	Kapil Arya <kapil@....neu.edu>,
	Gene Cooperman <gene@....neu.edu>,
	linux-kernel@...r.kernel.org, xemul@...ru,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Linux Containers <containers@...ts.osdl.org>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

Hi,

Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches

[now relax, grab (another) cup of coffee and read on...]

PART I:  ==PERSPECTIVE==

A rough classification of c/r categories:

* container-c/r: important use-case, e.g. c/r and migration of an
  application containers like VPS (virtual private server), VDI
  (desktop) or  other self-contained application (e.g. Oracle server).
  Here _all_ the relevant processes are included in the checkpoint.

* standalone-c/r: another use-case is standalone-c/r where a set of
  processes is checkpointed, but not the entire environment, and then
  those processes are restarted in a different "eco-system".

* distributed-c/r: meaning several sets of processes, each running
  on a different host. (Each set may be a separate container there).

In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.

In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.

There are generally three types of "glue":

(1) Modify the application or selected libraries to be c/r-aware, and
  notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
  trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
  at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
  that will glue in what's missing (e.g. dbus or nscd calls to
  reconnect an application to those services).

IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.

(next part: linux-cr philosophy...)

Thanks,

Oren.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/