[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49C1435B.1090809@google.com>
Date: Wed, 18 Mar 2009 11:54:19 -0700
From: Mike Waychison <mikew@...gle.com>
To: Oren Laadan <orenl@...columbia.edu>
CC: Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-api@...r.kernel.org, containers@...ts.linux-foundation.org,
hpa@...or.com, linux-kernel@...r.kernel.org,
Dave Hansen <dave@...ux.vnet.ibm.com>, linux-mm@...ck.org,
viro@...iv.linux.org.uk, mingo@...e.hu, mpm@...enic.com,
tglx@...utronix.de,
Sukadev Bhattiprolu <sukadev@...ux.vnet.ibm.com>,
Alexey Dobriyan <adobriyan@...il.com>, xemul@...nvz.org
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ
do?
Oren Laadan wrote:
>
> Mike Waychison wrote:
>> Linus Torvalds wrote:
>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>
>>>> Ying Han [yinghan@...gle.com] wrote:
>>>> | Hi Serge:
>>>> | I made a patch based on Oren's tree recently which implement a new
>>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>>> | and it works as expected.
>>>>
>>>> Yes, I think we had a version of clone() with pid a while ago.
>>> Are people _at_all_ thinking about security?
>>>
>>> Obviously not.
>>>
>>> There's no way we can do anything like this. Sure, it's trivial to do
>>> inside the kernel. But it also sounds like a _wonderful_ attack vector
>>> against badly written user-land software that sends signals and has small
>>> races.
>> I'm not really sure how this is different than a malicious app going off
>> and spawning thousands of threads in an attempt to hit a target pid from
>> a security pov. Sure, it makes it easier, but it's not like there is
>> anything in place to close the attack vector.
>>
>>> Quite frankly, from having followed the discussion(s) over the last few
>>> weeks about checkpoint/restart in various forms, my reaction to just about
>>> _all_ of this is that people pushing this are pretty damn borderline.
>>>
>>> I think you guys are working on all the wrong problems.
>>>
>>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>>> case process. Just TCP makes that fundamentally impossible in the general
>>> case, and there are lots and lots of other cases too (just something as
>>> totally _trivial_ as all the files in the filesystem that don't get rolled
>>> back).
>> In some instances such as ours, TCP is probably the easiest thing to
>> migrate. In an rpc-based cluster application, TCP is nothing more than
>> an RPC channel and applications already have to handle RPC channel
>> failure and re-establishment.
>>
>> I agree that this is not the 'general case' as you mention above
>> however. This is the bit that sorta bothers me with the way the
>> implementation has been going so far on this list. The implementation
>> that folks are building on top of Oren's patchset tries to be everything
>> to everybody. For our purposes, we need to have the flexibility of
>> choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
>> the kernel being responsible for checkpointing and restoring all
>> resources associated with a task, and leaving userland with nothing more
>> than transporting filesystem bits. This approach isn't flexible enough:
>> Consider the case where we want to stub out most of the TCP file
>> descriptors with ECONNRESETed sockets because we know that they are RPC
>> sockets and can re-establish themselves, but we want to use some other
>> mechanism for TCP sockets we don't know much about. The current
>> monolithic approach has zero flexibility for doing anything like this,
>> and I figure out how we could even fit anything like this in.
>
> The flexibility exists, but wasn't spelled out, so here it is:
>
> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
> something about specific resources, e.g.:
> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch
> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart
> etc .. (nevermind the exact interface right now)
>
> 2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
> or a restart complete successfully. At that time they can do their private
> house-keeping if they know better.
>
> 3) If restoring some resource is significantly easier in user space (e.g. a
> file-descriptor of some special device which user space knows how to
> re-initialize), then the restarting task can prepare it ahead of time,
> and, call:
> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying
> to restore it yourself.
This would be called by the embryo process (mktree.c?) before calling
sys_restart?
>
> Method #3 is what I used in Zap to implement distributed checkpoints, where
> it is so much easier to recreate all network connections in user space then
> putting that logic into the kernel.
>
> Now, on the other hand, doing the c/r from userland is much less flexible
> than in the kernel (e.g. epollfd, futex state and much more) and requires
> exposing tremendous amount of in-kernel data to user space. And we all know
> than exposing internals is always a one-way ticket :(
>
> [...]
>
> Oren.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists