[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <13316c2d-7984-01cd-f4a8-0cf93bd192e1@virtuozzo.com>
Date: Sun, 18 Mar 2018 00:13:00 +0300
From: Kirill Tkhai <ktkhai@...tuozzo.com>
To: Sowmini Varadhan <sowmini.varadhan@...cle.com>
Cc: santosh.shilimkar@...cle.com, davem@...emloft.net,
netdev@...r.kernel.org, linux-rdma@...r.kernel.org,
rds-devel@....oracle.com, edumazet@...gle.com
Subject: Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event()
(then kill NETDEV_UNREGISTER_FINAL)
On 17.03.2018 17:15, Sowmini Varadhan wrote:
>
> I spent a long time staring at both v1 and v2 of your patch.
Thanks for your time!
> I understand the overall goal, but I am afraid to say that these
> patches are complete hacks.
I'm not agree with you, see below the explanations.
> I was trying to understand why patchv1 blows with a null rtn in
> rds_tcp_init_net, but v2 does not, and the analysis is ugly.
>
> I'm going to put down the analysis here, and others can
> decide if this sort of hack is a healthy solution for a scaling
> issue (IMHO it is not- we should get the real fix for the
> scaling instead of using duck-tape-and-chewing-gum solutions)
>
> What is happening in v1 is this:
>
> 1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
> following in rds_tcp_init
> register_pernet_device(&rds_tcp_dev_ops);
> register_pernet_device(&rds_tcp_net_ops);
> Where rds_tcp_dev_ops has
> .id = &rds_tcp_netid,
> .size = sizeof(struct rds_tcp_net),
> and rds_tcp_net_ops has 0 values for both of these.
>
> 2. So now pernet_list has &rds_tcp_net_ops as the first member of the
> pernet_list.
>
> 3. Now I do "ip netns create blue". As part of setup_net(), we walk
> the pernet_list and call the ->init of each member (through ops_init()).
> So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
> skip the struct rds_tcp_net allocation, so rds_tcp_init_net would
> find a null return from net_generic() and bomb.
>
> The way I view it (and ymmv) the hack here is to call both
> register_pernet_device and register_pernet_subsys: the kernel only
> guarantees that calling *one* of register_pernet_* will ensure
> that you can safely call net_generic() afterwards.
>
> The v2 patch "works around" this by reordering the registration.
> So this time, init_net will set up the rds_tcp_net_ops as the second
> member, and the first memeber will be the pernet_operations struct
> that has non-zero id and size.
>
> But then the unregistration (necessarily) works in the opposite order
> you have to unregister_pernet_device first (so that interfaces are
> quiesced) and then unregister_pernet_subsys() so that sockets can
> be safely quiesced.
>
> I dont think this type of hack makes the code cleaner, it just
> make things much harder to understand, and completely brittle
> for subsequent changes.
It's not a hack, it's just a way to fix the problem, like other
pernet_operations do. It's OK for pernet_operations to share
net_generic() id. The only thing you need is to request the id in
the pernet_operations, which go the first in pernet_list. There are
a lot of examples in kernel:
1)sunrpc_net_ops
rpcsec_gss_net_ops
these pernet_operations share sunrpc_net_id. It's requested in sunrpc_net_ops:
static struct pernet_operations sunrpc_net_ops = {
.init = sunrpc_init_net,
.exit = sunrpc_exit_net,
.id = &sunrpc_net_id,
.size = sizeof(struct sunrpc_net),
and it's also used by rpcsec_gss_net_ops:
static struct pernet_operations rpcsec_gss_net_ops = {
.init = rpcsec_gss_init_net,
.exit = rpcsec_gss_exit_net,
};
rpcsec_gss_init_net()->gss_svc_init_net()->rsc_cache_create_net(), where:
static int rsc_cache_create_net(struct net *net)
{
struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
... ^^^here^^^
The only thing is sunrpc_net_ops must be registered before rpcsec_gss_net_ops,
and rpc code guarantees that.
2)ipvs_core_ops
ipvs_core_dev_ops
ip_vs_ftp_ops
ip_vs_lblc_ops
ip_vs_lblcr_ops
these pernet_operations (5!) share ip_vs_net_id, which is requested in ipvs_core_ops:
static struct pernet_operations ipvs_core_ops = {
.init = __ip_vs_init,
.exit = __ip_vs_cleanup,
.id = &ip_vs_net_id,
.size = sizeof(struct netns_ipvs),
};
static int __net_init __ip_vs_init(struct net *net)
{
struct netns_ipvs *ipvs;
...
ipvs = net_generic(net, ip_vs_net_id);
net->ipvs = ipvs;
...
}
static struct pernet_operations ipvs_core_dev_ops = {
.exit = __ip_vs_dev_cleanup,
};
static void __net_exit __ip_vs_dev_cleanup(struct net *net)
{
struct netns_ipvs *ipvs = net_ipvs(net);
^^^requested in ipvs_core_ops
...
}
Look at the above example. They solve the same problem, rds has.
They need to do some actions at pernet_device exit time. And there
is ipvs_core_dev_ops added for this, since ipvs_core_ops are called
in pernet_subsys time. See ip_vs_init():
static int __init ip_vs_init(void)
{
...
ret = register_pernet_subsys(&ipvs_core_ops); /* Alloc ip_vs struct */
if (ret < 0)
goto cleanup_conn;
...
ret = register_pernet_device(&ipvs_core_dev_ops);
That's all. They use pernet_device exit, which may be called in parallel
with anything, and which doesn't use rtnl_lock().
There is no reasons, rds_tcp_net_ops uses its own way different to all other
pernet_operations, and uses exclusive rtnl_lock().
> To solve the scaling problem why not just have a well-defined
> callback to modules when devices are quiesced, instead of
> overloading the pernet_device registration in this obscure way?
We already have them. These callbacks are called pernet_operations exit methods.
These methods can execute in parallel and scale nice.
Kirill
Powered by blists - more mailing lists