netdev - Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87k2db39zf.fsf@x220.int.ebiederm.org>
Date:   Thu, 13 Oct 2016 22:06:28 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Andrei Vagin <avagin@...tuozzo.com>
Cc:     Andrei Vagin <avagin@...nvz.org>, <netdev@...r.kernel.org>,
        <containers@...ts.linux-foundation.org>,
        "David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently

Andrei Vagin <avagin@...tuozzo.com> writes:

> On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote:
>> Andrei Vagin <avagin@...nvz.org> writes:
>> 
>> > From: Andrey Vagin <avagin@...nvz.org>
>> >
>> > The operation of destroying netns is heavy and it is executed under
>> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can
>> > be locked for a long time. It is impossible to create a new netns during
>> > this period of time.
>> 
>> This may be the right approach or at least the right approach to bound
>> net_mutex hold times but I have to take exception to calling network
>> namespace cleanup heavy.
>> 
>> The only particularly time consuming operation I have ever found are calls to
>> synchronize_rcu/sycrhonize_sched/synchronize_net.
>
> I booted the kernel with maxcpus=1, in this case these functions work
> very fast and the problem is there any way.
>
> Accoding to perf, we spend a lot of time in kobject_uevent:
>
> -   99.96%     0.00%  kworker/u4:1     [kernel.kallsyms]  [k] unregister_netdevice_many                                                                      ▒
>    - unregister_netdevice_many                                                                                                                               ◆
>       - 99.95% rollback_registered_many                                                                                                                      ▒
>          - 99.64% netdev_unregister_kobject                                                                                                                  ▒
>             - 33.43% netdev_queue_update_kobjects                                                                                                            ▒
>                - 33.40% kobject_put                                                                                                                          ▒
>                   - kobject_release                                                                                                                          ▒
>                      + 33.37% kobject_uevent                                                                                                                 ▒
>                      + 0.03% kobject_del                                                                                                                     ▒
>                + 0.03% sysfs_remove_group                                                                                                                    ▒
>             - 33.13% net_rx_queue_update_kobjects                                                                                                            ▒
>                - kobject_put                                                                                                                                 ▒
>                - kobject_release                                                                                                                             ▒
>                   + 33.11% kobject_uevent                                                                                                                    ▒
>                   + 0.01% kobject_del                                                                                                                        ▒
>                     0.00% rx_queue_release                                                                                                                   ▒
>             - 33.08% device_del                                                                                                                              ▒
>                + 32.75% kobject_uevent                                                                                                                       ▒
>                + 0.17% device_remove_attrs                                                                                                                   ▒
>                + 0.07% dpm_sysfs_remove                                                                                                                      ▒
>                + 0.04% device_remove_class_symlinks                                                                                                          ▒
>                + 0.01% kobject_del                                                                                                                           ▒
>                + 0.01% device_pm_remove                                                                                                                      ▒
>                + 0.01% sysfs_remove_file_ns                                                                                                                  ▒
>                + 0.00% klist_del                                                                                                                             ▒
>                + 0.00% driver_deferred_probe_del                                                                                                             ▒
>                  0.00% cleanup_glue_dir.isra.14.part.15                                                                                                      ▒
>                  0.00% to_acpi_device_node                                                                                                                   ▒
>                  0.00% sysfs_remove_group                                                                                                                    ▒
>               0.00% klist_del                                                                                                                                ▒
>               0.00% device_remove_attrs                                                                                                                      ▒
>          + 0.26% call_netdevice_notifiers_info                                                                                                               ▒
>          + 0.04% rtmsg_ifinfo_build_skb                                                                                                                      ▒
>          + 0.01% rtmsg_ifinfo_send                                                                                                                           ▒
>         0.00% dev_uc_flush                                                                                                                                   ▒
>         0.00% netif_reset_xps_queues_gt
>
> Someone can listen these uevents, so we can't stop sending them without
> breaking backward compatibility. We can try to optimize
> kobject_uevent...

Oh that is a surprise.  We can definitely skip genenerating uevents for
network namespaces that are exiting because by definition no one can see
those network namespaces.  If a socket existed that could see those
uevents it would hold a reference to the network namespace and as such
the network namespace could not exit.

That sounds like it is worth investigating a little more deeply.

I am surprised that allocation and freeing is so heavy we are spending
lots of time doing that.  On the other hand kobj_bcast_filter is very
dumb and very late so I expect something can be moved earlier and make
that code cheaper with the tiniest bit of work.

Eric