netdev - Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87wp2r33pz.fsf@xmission.com>
Date:   Wed, 15 Nov 2017 10:29:28 -0600
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     davem@...emloft.net, vyasevic@...hat.com,
        kstewart@...uxfoundation.org, pombredanne@...b.com,
        vyasevich@...il.com, mark.rutland@....com,
        gregkh@...uxfoundation.org, adobriyan@...il.com, fw@...len.de,
        nicolas.dichtel@...nd.com, xiyou.wangcong@...il.com,
        roman.kapl@...go.com, paul@...l-moore.com, dsahern@...il.com,
        daniel@...earbox.net, lucien.xin@...il.com,
        mschiffer@...verse-factory.net, rshearma@...cade.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        avagin@...tuozzo.com, gorcunov@...tuozzo.com
Subject: Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

Kirill Tkhai <ktkhai@...tuozzo.com> writes:

> On 15.11.2017 09:25, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@...tuozzo.com> writes:
>> 
>>> Curently mutex is used to protect pernet operations list. It makes
>>> cleanup_net() to execute ->exit methods of the same operations set,
>>> which was used on the time of ->init, even after net namespace is
>>> unlinked from net_namespace_list.
>>>
>>> But the problem is it's need to synchronize_rcu() after net is removed
>>> from net_namespace_list():
>>>
>>> Destroy net_ns:
>>> cleanup_net()
>>>   mutex_lock(&net_mutex)
>>>   list_del_rcu(&net->list)
>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>     ops_exit_list(ops, &net_exit_list)
>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>     ops_free_list(ops, &net_exit_list)
>>>   mutex_unlock(&net_mutex)
>>>
>>> This primitive is not fast, especially on the systems with many processors
>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>
>>> Create net_ns:
>>> copy_net_ns()
>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>
>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>> will require down_read() locking only, while down_write() will be used
>>> for changing pernet_list.
>>>
>>> This gives signify performance increase, like you may see below. There
>>> is measured sequential net namespace creation in a cycle, in single
>>> thread, without other tasks (single user mode):
>>>
>>> 1)int main(int argc, char *argv[])
>>> {
>>>         unsigned nr;
>>>         if (argc < 2) {
>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>                 return 1;
>>>         }
>>>         nr = atoi(argv[1]);
>>>         while (nr-- > 0) {
>>>                 if (unshare(CLONE_NEWNET)) {
>>>                         perror("Can't unshare");
>>>                         return 1;
>>>                 }
>>>         }
>>>         return 0;
>>> }
>>>
>>> Origin, 100000 unshare():
>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>
>>> Patched, 100000 unshare():
>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>
>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>
>>> Origin:
>>> real 1m24,190s
>>> user 0m6,225s
>>> sys 0m15,132s
>>>
>>> Patched:
>>> real 0m18,235s   (4.6 times faster)
>>> user 0m4,544s
>>> sys 0m13,796s
>>>
>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>> from Linus tree (not in net-next yet).
>> 
>> Using a rwsem to protect the list of operations makes sense.
>> 
>> That should allow removing the sing
>> 
>> I am not wild about taking a the rwsem down_write in
>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>> goes from being a mild hack to being a pretty bad hack and something
>> else that can kill the parallelism you are seeking it add.
>> 
>> There are about 204 instances of struct pernet_operations.  That is a
>> lot of code to have carefully audited to ensure it can in parallel all
>> at once.  The existence of the exit_batch method, net_ns_barrier,
>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>> to the fact that there are data structures accessed by multiple network
>> namespaces.
>> 
>> My preference would be to:
>> 
>> - Add the net_sem in addition to net_mutex with down_write only held in
>>   register and unregister, and maybe net_ns_barrier and
>>   rtnl_link_unregister.
>> 
>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>   being embedded an anonymous member of the old struct pernet_operations.
>> 
>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>   pernet_ops, that don't take net_mutex.  Have them order the
>>   pernet_list as:
>> 
>>   pernet_sys
>>   pernet_subsys
>>   pernet_device
>>   pernet_dev
>> 
>>   With the chunk in the middle taking the net_mutex.
>
> I think this approach will work. Thanks for the suggestion. Some more
> thoughts to the plan below.
>
> The only difficult thing there will be to choose the right order
> to move ops from pernet_subsys to pernet_sys and from pernet_device
> to pernet_dev one by one.
>
> This is rather easy in case of tristate drivers, as modules may be loaded
> at any time, and the only important order is dependences between them.
> So, it's possible to start from a module, who has no dependences,
> and move it to pernet_sys, and then continue with modules,
> who have no more dependences in pernet_subsys. For pernet_device
> it's vise versa.
>
> In case of bool drivers, the situation may be worse, because
> the order is not so clear there. The same priority initcalls
> (for example, initcalls, registered via core_initcall()) may require
> the certain order, driven by linking order. I know one example from
> device mapper code, which lives here: drivers/md/Makefile.
> This problem is also solvable, even if such places do not contain
> comments about linking order. It's just need to respect Makefile
> order, when choosing a new candidate to move.
>  
>>   I think I would enforce the ordering with a failure to register
>>   if a subsystem or a device tried to register out of order.  
>> 
>> - Disable use of the single threaded workqueue if nothing needs the
>>   net_mutex.
>
> We may use per-cpu worqueues in the future. The idea to refuse using
> worqueue doesn't seem good for me, because asynchronous net destroying
> looks very useful.

per-cpu workqueues are fine, and definitely what I am expecting.  If we
are doing this I want to get us off the single threaded workqueue that
serializes all of the cleanup.  That has a huge potential for
simplifying things and reducing maintenance if running everything in
parallel is actually safe.

I forget how the modern per-cpu workqueues work with respect to sleeps
and locking (I don't remember if when piece of work sleeps for a long
time we spin up another thread per-cpu workqueue thread, and thus avoid
priority inversion problems).

If locks between workqueues are not a problem we could start the
transition off of the single-threaded serializing workqueue sooner
rather than later.

>> - Add a test mode that deliberartely spawns threads on multiple
>>   processors and deliberately creates multiple network namespaces
>>   at the same time.
>> 
>> - Add a test mode that deliberately spawns threads on multiple
>>   processors and delibearate destrosy multiple network namespaces
>>   at the same time.>
>> - Convert the code to unlocked operation one pernet_operations to at a
>>   time.  Being careful with the loopback device it's order in the list
>>   strongly matters.
>> 
>> - Finally remove the unnecessary code.
>> 
>> 
>> At the end of the day because all of the operations for one network
>> namespace will run in parallel with all of the operations for another
>> network namespace all of the sophistication that goes into batching the
>> cleanup of multiple network namespaces can be removed.  As different
>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>> without slowing each other down.
>> 
>> I think we can remove the batching but I am afraid that will lead us into
>> rtnl_lock contention.
>
> I've looked into this lock. It used in many places for many reasons.
> It's a little strange, nobody tried to break it up in several small locks..

It gets tricky.  At this point getting net_mutex is enough to start
with. Mostly the rtnl_lock covers the slow path which tends to keep it
from rising to the top of the priority list.

>> I am definitely not comfortable with changing the locking on so much
>> code without an explanation at all why it is safe in the commit comments
>> in all 204 instances.  Which combined equal most of the well maintained
>> and regularly used part of the networking stack.
>
> David,
>
> could you please check the plan above and say whether 1)it's OK for you,
> and 2)if so, will you expect all the changes are made in one big 200+ patch set
> or we may go sequentially(50 patches a time, for example)?

I think I would start with the something covering the core networking
pieces so the improvements can be seen.


Eric