linux-kernel - Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTikgPKeZUih6_12JWSNy=Vsz2UP9HR7wHOz4q6ST@mail.gmail.com>
Date:	Thu, 4 Nov 2010 23:55:02 -0400
From:	Kapil Arya <kapil@....neu.edu>
To:	Oren Laadan <orenl@...columbia.edu>
Cc:	Tejun Heo <tj@...nel.org>,
	ksummit-2010-discuss@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, Gene Cooperman <gene@....neu.edu>,
	Kapil Arya <kapil@....neu.edu>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

(Sorry for the length of this email, we are excited about being able
to discuss technical details.)

This is wonderful to have this exchange of techniques and visions.  Oren, we
are guessing that you are at Columbia. If so, we would love to have you come up
here and give a talk in Boston. Alternatively, if you prefer, we would be happy
to go to Columbia and give a talk there.

In comparing functionality, one recent bug we had to overcome was with screen
with a hardstatus line and a scroll region for the terminal. We eventually
solved it in a subtle way by sending SIGWINCH, and then lying to screen about
changing the kernel window size, and then sending screen another SIGWINCH while
telling it the true window size. We were pleased to see that Linux C/R also
supports screen and we are curious how it handles this issue of restoring the
scroll region in the X11 terminal window. Thanks.

Oren noted that sometimes it's important to stop the process only for a few
miliseconds while one checkpoints. In DMTCP, we do that by configuring with
--enable-forked-checkpointing. This causes us to fork a child process taking
advantage of copy-on-write and then checkpoint the memory pages of the child
while the parent continues to execute.

> So a checkpoint will typically capture the state of e.g. a VNC server (X
> session) and the applications (xterm, win-manager etc), and the dbus daemon,
> and all their open files, and sockets etc.

This is a good example of distinct approaches when starting from Kernel C/R or
user-space C/R. We currently checkpoint VNC servers in a way similar to Linux
C/R. However, in the next few months, we want to directly checkpoint a single
X-windows application without the X11-server. The approach is easily understood
by analogy. Currently libc.so talks to the kernel. At checkpoint time, we
interrogate the kernel state and then "break" the connection to the kernel and
checkpoint. Similarly, libX11.so (or libX11-xcb.so) talks to the X11-server. At
checkpoint time, we will interrogate the state of the X11-server and then break
the connection and checkpoint.

> DMTCP is indeed a very cool project. ...  It is not my intention to bash
> their great work, but it's important to understand its limitations, so just a
> few examples:

Thanks very much for bringing up these implementation questions. Its wonderful
to have someone interested in the low level technology to talk to. We would
like to share with you our current solutions and our plans for the future. We
will also add some of our question about Linux C/R inline. Thanks for the
answers in advance.

> required to link against their library, or modify the binary;

We currently use LD_PRELOAD to transparently preload our library. The user
doesn't see this. If the application is statically linked, then this doesn't
work. Until now, we haven't seen user requests to support statically linked
applications. If we do, there are other techniques to modify the call sites or
entry points for libc routines within the user binary.

> They overload some signals (so the application can't use them)

By default, DMTCP uses SIGUSR2. At process startup, the user can specify:
dmtcp_checkpoint --mtcp-checkpoint-signal <signum> a.out to change the DMTCP
signal. In an additional point we have found interesting, libc has a similar
policy of using several hardwired signal:
    #define SIGCANCEL	__SIGRTMIN
    #define SIGTIMER	SIGCANCEL
    #define SIGSETXID	(__SIGRTMIN + 1)
So there is a precedent for this approach.

> Completeness: many real resources are not supported, e.g. eventpoll, ipc,
> pending signals, etc.

IPC and pending signals are supported. We know how to do eventpoll but haven't
encountered a use case from our userbase and so haven't added it yet.

> * Complexity: they technically implement a virtual pid-namespace in userspace
> by intercepting calls to clone(). I wonder if they consider e.g. pid's saved
> on file owners or in afunix creds ? I'll just say it's nearly impossible with
> their 20K lines of code - I know because I did it in a kernel module ...

We do wrap clone and create a table from original PID/TID to current PID/TID
just as you say. To our knowledge, we have wrappers for all system calls
involving a PID/TID except fcntl. We are guessing that either Linux C/R also
keeps a translation table or else restores the original PID/TID. Which do you
do? In the latter case what do you do if a PID/TID is already used by another
process/thread?

> * Efficiency: from userspace it can't tell which mapped pages are dirty and
> which aren't, not to mention doing incremental checkpoints.

One of the DMTCP team, Artem Polyakov, has developed incremental checkpointing
for DMTCP and for BLCR. We are still evaluating it. It's at:
    http://sourceforge.net/projects/hbict

> * Usefulness: can they live-migrate mysql server between two hosts prior to a
> kernel upgrade ?

We have not experimented with live-migration. Live-migration in user space is an
interesting topic but will take us into deep discussion outside of the current
scope. Of course VMware and others already do it. We would enjoy talking further
with you offline. It's certainly a cool use case.

> can they checkpoint stopped processes which cannot cooperate ?

We haven't had a user request for checkpointing stopped processes so far.
However one can use PTRACE (similar to doing gdb attach on stopped process) to
achieve this.

> can they checkpoint/restart postgresql ?

We don't know. We have succeeded on MySQL. We never tried postgresql. What are
the special issues there?

> In contrast, the kernel C/R is:
> ...
> * entirely transparent to applications (does not need their cooperation, can
> even do debugged tasks)

We are not sure what you are referring to by cooperation and debugged tasks. If
it helps, we can say that DMTCP can checkpoint an entire gdb session or just the
process being debugged by the gdb, according to the requirements. Our support
for PTRACE is in the unstable branch.

> * is easier to maintain in the long run (because you don't need to cheat
> applications by intercepting their kernel calls from userspace!)

We have to agree to disagree on this one. We see almost no new bugs or issues
with kernel upgrades. The most recent case was the need to add the wrapper for
pipe2 (2.6.27) and accept4 (2.6.28) and each wrapper was about 20 new lines of
code.

> * flexible to allow smart userspace to also be c/r aware, if they so wish

DMTCP also has a dmtcpaware facility by which applications can request
checkpoints for themselves or other processes. It also support user hook
functions for checkpoint, resume, and restart.

> * can provide a guarantee that a checkpoint is self-contained and can be
> later restarted

Could you tell us more about what do you mean by gurantee and self-contained?

> In fact, DMTCP will be much more useful if it builds on linux-cr as its
> chekcpoint-restart engine ;)

Your suggestion is an interesting one. One of our team members, Jason Ansel, has
made the same suggestion with respect to BLCR. This would be a great experiment
to try and we would be glad to work with you to get an initial version of DMTCP
on top of Linux C/R. DMTCP has a higher layer dmtcphijack.so and a lower layer
libmtcp.so (MTCP) which can be replaced by a modified single process
checkpointer with hooks for dmtcphijack.so. Unfortunately, our group
doesn't have
the resources to maintain and develop two branches: DMTCP/MTCP and
DMTCP/Linux C/R. Nevertheless, if you were interested in going forward on the
DMTCP/Linux C/R branch, we could share code and ideas.

> Actually, because of the huge optimization potential that exists only in
> kernel based C/R, the HPC applications are likely to benefit tremendously too
> from it. Think about things like incremental checkpoint, pre-copy to minimize
> downtime (like live-migration), using COW to defer disk IO until after the
> application can resume execution, and more. None of these is possible with
> userspace C/R.

BLCR is a kernel-based C/R package, and appears to be the current standard for
HPC. Are you saying that BLCR should be replaced by Linux C/R, if so, why?

Concerning user space C/R, please see our comments above.

> I know of several places that do not use C/R because they can't stop their
> long running processes for longer than a few milliseconds.  I know how to
> solve their problems with linux-cr. I doubt if any userspace mechanism can
> get there.

DMTCP supports forked checkpointing as a configure option. A child is forked
using COW and it writes its memory to disk at leisure.


                                                 Thanks,
                                                 Gene Cooperman and Kapil Arya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/