netdev - Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Sat, 17 Mar 2018 10:15:07 -0400
From:   Sowmini Varadhan <sowmini.varadhan@...cle.com>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     santosh.shilimkar@...cle.com, davem@...emloft.net,
        netdev@...r.kernel.org, linux-rdma@...r.kernel.org,
        rds-devel@....oracle.com, edumazet@...gle.com
Subject: Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in
 rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

I spent a long time staring at both v1 and v2 of your patch.

I understand the overall goal, but I am afraid to say that these
patches are complete hacks.

I was trying to understand why patchv1 blows with a null rtn in 
rds_tcp_init_net, but v2 does not, and the analysis is ugly.  

I'm going to put down the analysis here, and others can
decide if this sort of hack is a healthy solution for a scaling
issue (IMHO  it is not- we should get the real fix for the
scaling instead of using duck-tape-and-chewing-gum solutions)

What is happening in v1 is this:

1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
   following in rds_tcp_init
       register_pernet_device(&rds_tcp_dev_ops);
       register_pernet_device(&rds_tcp_net_ops);
   Where rds_tcp_dev_ops has 
        .id = &rds_tcp_netid,
        .size = sizeof(struct rds_tcp_net),
   and rds_tcp_net_ops has 0 values for both of these.

2. So now pernet_list has &rds_tcp_net_ops as the first member of the
   pernet_list.

3. Now I do "ip netns create blue". As part of setup_net(), we walk
   the pernet_list and call the ->init of each member (through ops_init()). 
   So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
   skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
   find a null return from net_generic() and bomb.

The way I view it (and ymmv) the hack here is to call both
register_pernet_device and register_pernet_subsys: the kernel only
guarantees that calling *one* of register_pernet_* will ensure
that you can safely call net_generic() afterwards.

The v2 patch "works around" this by reordering the registration. 
So this time, init_net will set up the rds_tcp_net_ops as the second 
member, and the first memeber will be the pernet_operations struct 
that has non-zero id and size.

But then the unregistration (necessarily) works in the opposite order
you have to unregister_pernet_device first (so that interfaces are
quiesced) and then unregister_pernet_subsys() so that sockets can
be safely quiesced.

I dont think this type of hack makes the code cleaner, it just
make things much harder to understand, and completely brittle
for subsequent changes.

To solve the scaling problem why not just have a well-defined 
callback to modules when devices are quiesced, instead of 
overloading the pernet_device registration in this obscure way?