[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A6F306A.40303@librato.com>
Date: Tue, 28 Jul 2009 13:07:54 -0400
From: Oren Laadan <orenl@...rato.com>
To: Dan Smith <danms@...ibm.com>
CC: John Dykstra <john.dykstra1@...il.com>, containers@...ts.osdl.org,
netdev@...r.kernel.org, Alexey Dobriyan <adobriyan@...il.com>
Subject: Re: [PATCH 2/2] c/r: Add AF_INET support (v3)
Dan Smith wrote:
> JD> I don't know whether this would affect palatibility (sic) for the
> JD> mainline, but it bothers me... socket.h (and sock.h) are included
> JD> all over the place in the kernel, and this definition is only
> JD> needed in very limited locations. Can it be placed in a .h used
> JD> only by checkpoint code?
>
> This was moved here based on previous comments on the patch.
> Personally, I think socket.h is a little too wide. While iterating on
> this patch locally, I've discovered that socket.h won't really work
> because in6.h includes it early and thus I'm unable to include some of
> the address structures as a result. I think going back to a
> checkpoint-specific header would be helpful.
>
> JD> There's an interesting design choice re TIME_WAIT and similar
> JD> states. In a migration scenario, do those sks migrate? If the
> JD> tree isn't going to be restored for a while., you want the
> JD> original host to continue to respond to those four-tuples If the
> JD> IP address of the original is immediately migrated to another
> JD> machine, you want the TIME_WAIT sks to migrate too.
>
> Well, as far as I'm concerned, the only sane migration scenario is one
> where you migrate the IP address and virtual interface with the
> task. When you destroy the virtual interface after (or during) the
> migration, the TIME_WAIT socket will disappear from the sending host.
Note that such sockets are unlikely to be referenced by _any_ process
because they will have been closed already. So the checkpoint code
will need to find such sockets not through file descriptors.
>
> I think that we need to increment timers like this on restore anyway,
> which should make sure that the TIME_WAIT timers run for the same
> amount of wall-clock time regardless of how much time was spent in the
> process of migration, right?
>
> Since checkpoint is not aware of a potential migration, a regular
> (i.e. not intended for migration) checkpoint operation on a task
> running in the init netns will leave the TIME_WAIT socket in place
> until the timer expires.
>
>>> +static int sock_inet_restart(struct ckpt_ctx *ctx,
>>> + struct ckpt_hdr_socket *h,
>>> + struct socket *socket)
>>> +{
>>> + int ret;
>>> + struct ckpt_hdr_socket_inet *in;
>>> + struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
>>> +
>>> + in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
>>> + if (IS_ERR(in))
>>> + return PTR_ERR(in);
>>> +
>>> + /* Listening sockets and those that are closed but have a local
>>> + * address need to call bind()
>>> + */
>>> + if ((h->sock.state == TCP_LISTEN) ||
>>> + ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
>>> + socket->sk->sk_reuse = 2;
>>> + inet_sk(socket->sk)->freebind = 1;
>>> + ret = socket->ops->bind(socket,
>>> + (struct sockaddr *)l,
>>> + h->laddr_len);
>>> + ckpt_debug("inet bind: %i", ret);
>>> + if (ret < 0)
>>> + goto out;
>>> +
>>> + if (h->sock.state == TCP_LISTEN) {
>>> + ret = socket->ops->listen(socket, h->sock.backlog);
>>> + ckpt_debug("inet listen: %i", ret);
>>> + if (ret < 0)
>>> + goto out;
>>> +
>>> + ret = sock_add_parent(ctx, socket->sk);
>>> + if (ret < 0)
>>> + goto out;
>
> JD> So this is just dummied off as a proof-of-concept for LISTEN?
>
> I'm not sure what you mean...
>
>>> + }
>>> + } else {
>>> + ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
>>> + if (ret)
>>> + goto out;
>
> JD> At a minimum, you'll want to start the TCP retransmit timer if there is
> JD> unacknowledged data outstanding. And other timers for other states, as
> JD> they're supported.
>
> JD> And you probably do want to do slow-start again--disregard my babbling
> JD> from yesterday.
>
> Heh, okay :)
>
> JD> This is part of your AF_UNIX patch:
>
> JD> struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
> JD> struct ckpt_hdr_socket *h)
> JD> {
> JD> struct socket *socket;
> JD> int ret;
>
> JD> ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
> JD> if (ret < 0)
> JD> return ERR_PTR(ret);
>
>
> JD> You _really_ want to pass the actual protocol number to sock_create().
> JD> The size of the sk it creates depends on this. You'll quickly be in
> JD> memory corruption hell without it.
>
> You mean I need to verify that the protocol is one of IPPROTO_TCP,
> IPPROTO_UDP, or PF_UNIX, right?
>
> JD> Also from the AF_UNIX patch:
>
> JD> static int obj_sock_users(void *ptr)
> JD> {
> JD> return atomic_read(&((struct sock *) ptr)->sk_refcnt);
> JD> }
>
> JD> Network sockets also use sk->sk_wmem_alloc to track references to
> JD> the sock from egress skb's
>
> IIUC, this function is part of the framework's leak detection. That
> code looks to make sure objects don't gain any additional users
> between the start and end of the checkpoint operation. I think the
> sk_refcnt satisfies that requirement here, no?
As I commented on patch 5/5 AF_UNIX -- IMO socket objects don't require
leak-detection, since they are 1-to-1 with files, which are already
accounted for. sockets can't be otherwise cloned/inherited/transferred
between processes.
>
> JD> And:
>
> JD> static int sock_copy_buffers(struct sk_buff_head *from, struct
> JD> sk_buff_head *to)
> JD> {
> JD> int count = 0;
> JD> struct sk_buff *skb;
>
> JD> skb_queue_walk(from, skb) {
> JD> struct sk_buff *tmp;
>
> JD> tmp = dev_alloc_skb(skb->len);
> JD> if (!tmp)
> JD> return -ENOMEM;
>
> JD> spin_lock(&from->lock);
> JD> skb_morph(tmp, skb);
> JD> spin_unlock(&from->lock);
>
> JD> Why not just clone the skb?
>
> Okay, good call.
>
> Thanks!
>
Oren.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists