[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B92D574.1090006@cs.columbia.edu>
Date: Sat, 06 Mar 2010 17:21:40 -0500
From: Oren Laadan <orenl@...columbia.edu>
To: Dan Smith <danms@...ibm.com>
CC: containers@...ts.osdl.org, den@...nvz.org, netdev@...r.kernel.org,
davem@...emloft.net, ebiederm@...ssion.com, benjamin.thery@...l.net
Subject: Re: [PATCH 2/6] C/R: Basic support for network namespaces and devices
(v5)
Dan Smith wrote:
> OL> What about leak detection ?
> OL> Aren't we missing {netns,netdev}_users()?
>
> This is something I need to give more thought to, but it's not as easy
> as it sounds. Network devices aren't released at the last put() like
> a lot of other things, and my initial attempts to reconcile the
> refcount after a checkpoint operation have not been successful.
>
> However, I'm not sure that it's as important here, because AFAIK, a
> network device can only exist in one network namespace at a time. If
> we're checkpointing a netdev, it's because we are checkpointing the
> namespace that it lives in. Making sure the netns isn't leaked out of
> the process tree would be much easier and just as effective, no?
We should guarantee that neither netns nor netdev leaks outside
the container; currently none is. If a netdev can only belong to
a single netns, then it suffices to only care about netns.
>
>>> +config CHECKPOINT_NETNS
>>> + bool
>>> + default y if NET && NET_NS && CHECKPOINT
>>> +
>
> OL> Did you mean this to be visible (settable) by the user ?
>
> No, it was specifically supposed to enable itself when those other
> items are enabled, but not be a user adjustable toggle. I had a
> discussion with Serge about it and we came to this as a solution,
> although I don't remember what the problem we started with was. I'll
> dig through my IRC logs to see if I can figure it out.
Duh.. my bad, I misinterpreted the code. That's fine.
BTW, there is a similar SYSVIPC_CHECKPOINT - we should decide
if we do X_CHECKPOINT or CHECKPOINT_X for a subsystem X, and
stick to that convention. I prefer the latter - what you did...
>
>>> + retry:
>>> + if (++pages > 4) {
>>> + addrs = -E2BIG;
>>> + goto out;
>>> + }
>
> OL> Why 4 ?
>
> It's just a sanity limit.
Hmm... let me be more explicit: why not keep trying until it
realloc fails ? or switch to vmalloc() at some point ?
>
> OL> Do we really need this special case ? I'd be happy with a ckpt_err()
> OL> for any error - and the actual error number would be useful to tell
> OL> which case it was.
>
> Unless I'm missing something, you asked for this specifically:
>
> https://lists.linux-foundation.org/pipermail/containers/2010-February/022844.html
Lol .. that was me :o But looking at the code it feels wrong,
because the errno already reveals the type of the problem.
I'm thinking - wouldn't it make sense to do error reporting
in checkpoint_netdev() if the call to ->ndo_checkpoint() fails ?
>
> OL> Isn't this check redundant ? I expect it to fail promptly in
> OL> checkpoint_netdev() above.
>
> No, as I've said a couple of times previously, this isn't the only way
> we can arrive at a netdev for checkpointing. This case is the one
> where we're marching through the netns and find a netdev that is not
> supported. The other is where we arrive at a device as a peer of
> another device, so the other check may come into play at times where
> this one doesn't and vice versa.
I'm confused: in checkpoint_ns() inside the for_each_netdev()
loop you first test for dev->netdev_ops->ndo_checkpoint and
then call checkpoint_obj(... CKPT_OBJ_NETDEV) - which in turn
will call checkpoint_netdev(), which will again test for
dev->netdev_ops->ndo_checkpoint ... am I reading it wrongly ?
>
> OL> This may be a bit simpler if you move the first deferqueue_add()
> OL> forward to just before the other one. Or better: change dq_netdev
> OL> to have two pointers, dev and peer (if any is null, the cleanup
> OL> function will skip).
>
> The reason it is this messy is because of the way network devices are
> deallocated. Since they don't destroy themselves on the final put(),
> we have to explicitly call unregister_netdev() on them when we know
> they're no longer used (else we block). Once we've added them to the
> deferqueue, we can no longer destroy them here because a reference is
> held and the deferqueue will run afterwards.
>
> The ordering of this is a result of me injecting failures at each step
> and working it out until I got it to not block on unregistering either
> of the devices in all of the error paths. That's not to say it's the
> best way, but there is a reason it's ordered the way it is.
>
How about this - to me it feels simpler:
dev = rtnl_newlink(veth_new_link_msg, &veth, this_name);
if (IS_ERR(dev))
return dev;
peer = dev_get_by_name(current->nsproxy->net_ns, peer_name);
if (!peer) {
ret = -EINVAL;
goto err_dev;
}
ret = ckpt_obj_insert(ctx, peer, h->veth.peer_ref,
CKPT_OBJ_NETDEV);
if (ret < 0)
goto err_peer;
dev_put(peer);
dq.dev = dev;
dq.peer = peer;
ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
netdev_noop, netdev_cleanup);
if (ret)
goto err_peer;
(yes, you need to adjust struct dq_netdev and netdev_cleanup).
BTW, the variable "didreg" should disappear from restore_veth().
Oren.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists