linux-kernel - Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <49BDEB28.7080302@cs.columbia.edu>
Date:	Mon, 16 Mar 2009 02:01:12 -0400
From:	Oren Laadan <orenl@...columbia.edu>
To:	Ingo Molnar <mingo@...e.hu>
CC:	Alexey Dobriyan <adobriyan@...il.com>, linux-api@...r.kernel.org,
	containers@...ts.linux-foundation.org, mpm@...enic.com,
	linux-kernel@...r.kernel.org,
	Dave Hansen <dave@...ux.vnet.ibm.com>, linux-mm@...ck.org,
	viro@...iv.linux.org.uk, hpa@...or.com,
	Andrew Morton <akpm@...ux-foundation.org>,
	Sukadev Bhattiprolu <sukadev@...ux.vnet.ibm.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	tglx@...utronix.de, xemul@...nvz.org
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ
 do?

Ingo Molnar wrote:
> * Alexey Dobriyan <adobriyan@...il.com> wrote:
> 
>> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
>>>
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>> Let's face it, we're not going to _ever_ checkpoint any 
>>>>> kind of general case process. Just TCP makes that 
>>>>> fundamentally impossible in the general case, and there 
>>>>> are lots and lots of other cases too (just something as 
>>>>> totally _trivial_ as all the files in the filesystem 
>>>>> that don't get rolled back).
>>>> What do you mean here? Unlinked files?
>>> Or modified files, or anything else. "External state" is a 
>>> pretty damn wide net. It's not just TCP sequence numbers and 
>>> another machine.
>> I think (I think) you're seriously underestimating what's 
>> doable with kernel C/R and what's already done.
>>
>> I was told (haven't seen it myself) that Oracle installations 
>> and Counter Strike servers were moved between boxes just fine.
>>
>> They were run in specially prepared environment of course, but 
>> still.
> 
> That's the kind of stuff i'd like to see happen.
> 
> Right now the main 'enterprise' approach to do 
> migration/consolidation of server contexts is based on hardware 
> virtualization - but that pushes runtime overhead to the native 
> kernel and slows down the guest context as well - massively so.
> 
> Before we've blinked twice it will be a 'required' enterprise 
> feature and enterprise people will measure/benchmark Linux 
> server performance in guest context primarily and we'll have a 
> deep performance pit to dig ourselves out of.
> 
> We can ignore that trend as uninteresting (it is uninteresting 
> in a number of ways because it is partly driven by stupidity), 
> or we can do something about it while still advancing the 
> kernel.
> 
> With containers+checkpointing the code is a lot scarier (we 
> basically do system call virtualization), the environment 
> interactions are a lot wider and thus they are a lot more 
> difficult to handle - but it's all a lot faster as well, and 
> conceptually so. All the runtime overhead is pushed to the 
> checkpointing step - (with some minimal amount of data structure 
> isolation overhead).

It's worthwhile the make the distinction between virtualization and
checkpoint/restart (c/r). Virtualization is about decoupling of the
applications from the underlying operating system by providing a
private and and virtual namespace, that is - containers. Checkpoint/
restart is ability to save the state of a container so that it can
be restart later from that point.

The point is, that virtualization is *already* part of the kernel
through namespaces (pid, ipc, mounts, etc). This considerable body
of work was eventually merged and is mostly complete, covering most
of the environment interactions. The runtime overhead is negligible.

Seeing that namespaces are now part of the kernel, we now build on
the existing virtualization to allow checkpoint/restart. The code is
not at all scary: record the state on checkpoint, and restore it on
restart. There is no runtime overhead for checkpoint but the downtime
incurred on an application when it is frozen for the duration of the
checkpoint.

> 
> I see three conceptual levels of virtualization:
> 
>  - hardware based virtualization, for 'unaware OSs'
> 
>  - system call based virtualization, for 'unaware software'
> 
>  - no virtualization kernel help is needed _at all_ to 
>    checkpoint 'aware' software. We have libraries to checkpoint 
>    'aware' user-space just fine - and had them for a decade.

Checkpoint/restart is almost orthogonal to virtualization (c/r only
needs a way to request a specific resource identifier for resources
that it creates). Therefore, the effort required to allow c/r of
'aware' software is nearly the same as for 'unaware' software.

IMHO this is the natural next time: make the c/r useful and attractive
by making it transparent (support 'unaware' software), complete (cover
nearly all features) and efficient (with low application downtime).

And this is precisely what we aim for with the current patchset.

Oren.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/