[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4BE3B932.9060608@cs.columbia.edu>
Date: Fri, 07 May 2010 02:54:42 -0400
From: Oren Laadan <orenl@...columbia.edu>
To: Randy Dunlap <randy.dunlap@...cle.com>
CC: Andrew Morton <akpm@...ux-foundation.org>,
containers@...ts.linux-foundation.org,
linux-kernel@...r.kernel.org, Serge Hallyn <serue@...ibm.com>,
Matt Helsley <matthltc@...ibm.com>,
Pavel Emelyanov <xemul@...nvz.org>, linux-api@...r.kernel.org,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
netdev@...r.kernel.org, Dave Hansen <dave@...ux.vnet.ibm.com>
Subject: Re: [PATCH v21 020/100] c/r: documentation
Thanks for reading carefully through and pointing out
glitches and inconsistencies. I'll fix it for next post.
Oren.
On 05/06/2010 04:27 PM, Randy Dunlap wrote:
> On Sat, 1 May 2010 10:15:02 -0400 Oren Laadan wrote:
>
>> Covers application checkpoint/restart, overall design, interfaces,
>> usage, shared objects, and and checkpoint image format.
>>
>> Signed-off-by: Oren Laadan <orenl@...columbia.edu>
>> Signed-off-by: Dave Hansen <dave@...ux.vnet.ibm.com>
>> Acked-by: Serge E. Hallyn <serue@...ibm.com>
>> Tested-by: Serge E. Hallyn <serue@...ibm.com>
>> ---
>> Documentation/checkpoint/checkpoint.c | 38 +++
>> Documentation/checkpoint/readme.txt | 370 ++++++++++++++++++++++++++++
>> Documentation/checkpoint/self_checkpoint.c | 69 +++++
>> Documentation/checkpoint/self_restart.c | 40 +++
>> Documentation/checkpoint/usage.txt | 247 +++++++++++++++++++
>> 5 files changed, 764 insertions(+), 0 deletions(-)
>> create mode 100644 Documentation/checkpoint/checkpoint.c
>> create mode 100644 Documentation/checkpoint/readme.txt
>> create mode 100644 Documentation/checkpoint/self_checkpoint.c
>> create mode 100644 Documentation/checkpoint/self_restart.c
>> create mode 100644 Documentation/checkpoint/usage.txt
>
>> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
>> new file mode 100644
>> index 0000000..4fa5560
>> --- /dev/null
>> +++ b/Documentation/checkpoint/readme.txt
>> @@ -0,0 +1,370 @@
>> +
> ...
>> +In contrast, when checkpointing a subtree of a container it is up to
>> +the user to ensure that dependencies either don't exist or can be
>> +safely ignored. This is useful, for instance, for HPC scenarios or
>> +even a user that would like to periodically checkpoint a long-running
>
> who
>
>> +batch job.
>> +
> ...
>
>> +
>> +Checkpoint image format
>> +=======================
>> +
> ...
>
>> +
>> +The container configuration section containers information that is
>
> contains
>
>> +global to the container. Security (LSM) configuration is one example.
>> +Network configuration and container-wide mounts may also go here, so
>> +that the userspace restart coordinator can re-create a suitable
>> +environment.
>> +
> ...
>
>> +
>> +Then the state of all tasks is saved, in the order that they appear in
>> +the tasks array above. For each state, we save data like task_struct,
>> +namespaces, open files, memory layout, memory contents, cpu state,
>
> CPU (throughout, please)
>
>> +signals and signal handlers, etc. For resources that are shared among
>> +multiple processes, we first checkpoint said resource (and only once),
>> +and in the task data we give a reference to it. More about shared
>> +resources below.
>> +
> ...
>
>> +
>> +Shared objects
>> +==============
>> +
>> +Many resources may be shared by multiple tasks (e.g. file descriptors,
>> +memory address space, etc), or even have multiple references from
>
> etc.),
>
>> +other resources (e.g. a single inode that represents two ends of a
>> +pipe).
>> +
> ...
>
>> +Memory contents format
>> +======================
>> +
>> +The memory contents of a given memory address space (->mm) is dumped
>
> are (I think)
>
>> +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
>> +This header details the vma properties, and a reference to a file
>> +(if file backed) or an inode (or shared memory) object.
>> +
>> +The vma header is followed by the actual contents - but only those
>> +pages that need to be saved, i.e. dirty pages. They are written in
>> +chunks of data, where each chunks contains a header that indicates
>
> chunk
>
>> +that number of pages in the chunk, followed by an array of virtual
>
> the
>
>> +addresses and then an array of actual page contents. The last chunk
>> +holds zero pages.
>> +
> ...
>
>> +Kernel interfaces
>> +=================
>> +
>> +* To checkpoint a vma, the 'struct vm_operations_struct' needs to
>> + provide a method ->checkpoint:
>> + int checkpoint(struct ckpt_ctx *, struct vma_struct *)
>> + Restart requires a matching (exported) restore:
>> + int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
>> +
>> +* To checkpoint a file, the 'struct file_operations' needs to provide
>> + the methods ->checkpoint and ->collect:
>> + int checkpoint(struct ckpt_ctx *, struct file *)
>> + int collect(struct ckpt_ctx *, struct file *)
>> + Restart requires a matching (exported) restore:
>> + int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
>> + For most file systems, generic_file_{checkpoint,restore}() can be
>> + used.
>> +
>> +* To checkpoint a socket, the 'struct proto_ops' needs to provide
>
> To checkpoint/restart a socket,
>
>> + the methods ->checkpoint, ->collect and ->restore:
>> + int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
>> + int collect(struct ckpt_ctx *ctx, struct socket *sock);
>> + int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)
>
>
>> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
>> new file mode 100644
>> index 0000000..c6fc045
>> --- /dev/null
>> +++ b/Documentation/checkpoint/usage.txt
>> @@ -0,0 +1,247 @@
>> +
>> + How to use Checkpoint-Restart
>> + =========================================
>> +
>> +
>> +API
>> +===
>> +
>> +The API consists of three new system calls:
>> +
>> +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd);
>
> flags,
>
>> +
>> + Checkpoint a (sub-)container whose root task is identified by @pid,
>> + to the open file indicated by @fd. If @logfd isn't -1, it indicates
>> + an open file to which error and debug messages are written. @flags
>> + may be one or more of:
>> + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
>> + (other value are not allowed).
>> +
>> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
>> + it returns from a restart, and -1 if an error occurs. The ckptid will
>> + uniquely identify a checkpoint image, for as long as the checkpoint
>> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
>> + partial checkpoint, residing in kernel memory).
>> +
>> +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd);
>> +
>> + Restart a process hierarchy from a checkpoint image that is read from
>> + the blob stored in the file indicated by @fd. If @logfd isn't -1, it
>> + indicates an open file to which error and debug messages are written.
>> + @flags will have future meaning (must be 0 for now). @pid indicates
>> + the root of the hierarchy as seen in the coordinator's pid-namespace,
>> + and is expected to be a child of the coordinator. @flags may be one
>> + or more of:
>> + - RESTART_TASKSELF : (self) restart of a single process
>> + - RESTART_FROEZN : processes remain frozen once restart completes
>
> FROZEN ?
>
>> + - RESTART_GHOST : process is a ghost (placeholder for a pid)
>
> about @flags: Above says both of these:
> a) @flags will have future meaning (must be 0 for now)
> b) @flags may be one or more of:
>
> so please decide which one it is ;)
>
>> + (Note that this argument may mean 'ckptid' to identify an in-kernel
>> + checkpoint image, with some @flags in the future).
>> +
>> + Returns: -1 if an error occurs, 0 on success when restarting from a
>> + "self" checkpoint, and return value of system call at the time of the
>> + checkpoint when restarting from an "external" checkpoint.
>> +
> ...
>> +
>> +Sysctl/proc
>> +===========
>> +
>> +/proc/sys/kernel/ckpt_unpriv_allowed [default = 1]
>> + controls whether c/r operation is allowed for unprivileged users
>
> C/R
>
>> +
>> +
>> +Operation
>> +=========
>> +
>> +The granularity of a checkpoint usually is a process hierarchy. The
>> +'pid' argument is interpreted in the caller's pid namespace. So to
>> +checkpoint a container whose init task (pid 1 in that pidns) appears
>> +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
>> +pid 1 will attempt to checkpoint the caller's container, and if the
>> +caller isn't privileged and init is owned by root, it will fail.
>> +
>> +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
>> +which does not refer to a container's init task, then sys_checkpoint()
>> +would return -EINVAL.
>
> returns -EINVAL.
>
> ...
>
>> +
>> +
>> +User tools
>> +==========
>> +
>> +* checkpoint(1): a tool to perform a checkpoint of a container/subtree
>> +* restart(1): a tool to restart a container/subtree
>> +* ckptinfo: a tool to examine a checkpoint image
>> +
>> +It is best to use the dedicated user tools for checkpoint and restart.
>> +
>> +If you insist, then here is a code snippet that illustrates how a
>> +checkpoint is initiated by a process inside a container - the logic is
>> +similar to fork():
>> + ...
>> + ckptid = checkpoint(0, ...);
>> + switch (crid) {
>
> (ckptid) ?
>
>> + case -1:
>> + perror("checkpoint failed");
>> + break;
>> + default:
>> + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
>
> s/ret/ckptid/ ?
>
>> + /* proceed with execution after checkpoint */
>> + ...
>> + break;
>> + case 0:
>> + fprintf(stderr, "returned after restart\n");
>> + /* proceed with action required following a restart */
>> + ...
>> + break;
>> + }
>> + ...
>> +
>> +And to initiate a restart, the process in an empty container can use
>> +logic similar to execve():
>> + ...
>> + if (restart(pid, ...) < 0)
>> + perror("restart failed");
>> + /* only get here if restart failed */
>> + ...
>> +
>> +Note, that the code also supports "self" checkpoint, where a process
>
> Note that
>
>> +can checkpoint itself. This mode does not capture the relationships of
>> +the task with other tasks, or any shared resources. It is useful for
>> +application that wish to be able to save and restore their state.
>
> applications
>
>> +They will either not use (or care about) shared resources, or they
>> +will be aware of the operations and adapt suitably after a restart.
>> +The code above can also be used for "self" checkpoint.
>> +
>> +
>> +You may find the following sample programs useful:
>> +
>> +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout
>
> checkpoints
>
>> +* self_checkpoint.c: a simple test program doing self-checkpoint
>> +* self_restart.c: restarts a (self-) checkpoint image from stdin
>> +
>> +See also the utilities 'checkpoint' and 'restart' (from user-cr).
>> +
>> +
>> +"External" checkpoint
>> +=====================
>> +
>> +To do "external" checkpoint, you need to first freeze that other task
>> +either using the freezer cgroup.
>
> eh? cannot parse that.
>
>> +
>> +Restart does not preserve the original PID yet, (because we haven't
>> +solved yet the fork-with-specific-pid issue). In a real scenario, you
>> +probably want to first create a new names space, and have the init
>
> namespace,
>
>> +task there call 'sys_restart()'.
>> +
>> +I tested it this way:
>
> ...
>
> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists