netdev - Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8b9e4d3-b4e1-a2ee-4504-46dbe46154fe@virtuozzo.com>
Date:   Fri, 17 Nov 2017 19:46:55 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     davem@...emloft.net, vyasevic@...hat.com,
        kstewart@...uxfoundation.org, pombredanne@...b.com,
        vyasevich@...il.com, mark.rutland@....com,
        gregkh@...uxfoundation.org, adobriyan@...il.com, fw@...len.de,
        nicolas.dichtel@...nd.com, xiyou.wangcong@...il.com,
        roman.kapl@...go.com, paul@...l-moore.com, dsahern@...il.com,
        daniel@...earbox.net, lucien.xin@...il.com,
        mschiffer@...verse-factory.net, rshearma@...cade.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        avagin@...tuozzo.com, gorcunov@...tuozzo.com
Subject: Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it
 on net->init/->exit

On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@...tuozzo.com> writes:
> 
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@...tuozzo.com> writes:
>>>
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>>
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>>
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>>   mutex_lock(&net_mutex)
>>>>   list_del_rcu(&net->list)
>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_exit_list(ops, &net_exit_list)
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_free_list(ops, &net_exit_list)
>>>>   mutex_unlock(&net_mutex)
>>>>
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>
>>>> Create net_ns:
>>>> copy_net_ns()
>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>>
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>>
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>>         unsigned nr;
>>>>         if (argc < 2) {
>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>                 return 1;
>>>>         }
>>>>         nr = atoi(argv[1]);
>>>>         while (nr-- > 0) {
>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>                         perror("Can't unshare");
>>>>                         return 1;
>>>>                 }
>>>>         }
>>>>         return 0;
>>>> }
>>>>
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>>
>>>> Patched:
>>>> real 0m18,235s   (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>>
>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>> from Linus tree (not in net-next yet).
>>>
>>> Using a rwsem to protect the list of operations makes sense.
>>>
>>> That should allow removing the sing
>>>
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier.  I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>>
>>> There are about 204 instances of struct pernet_operations.  That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once.  The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>>
>>> My preference would be to:
>>>
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>   register and unregister, and maybe net_ns_barrier and
>>>   rtnl_link_unregister.
>>>
>>> - Factor out struct pernet_ops out of struct pernet_operations.  With
>>>   struct pernet_ops not having the exit_batch method.  With pernet_ops
>>>   being embedded an anonymous member of the old struct pernet_operations.
>>>
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>   pernet_ops, that don't take net_mutex.  Have them order the
>>>   pernet_list as:
>>>
>>>   pernet_sys
>>>   pernet_subsys
>>>   pernet_device
>>>   pernet_dev
>>>
>>>   With the chunk in the middle taking the net_mutex.
>>
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>>
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>>
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with modules,
>> who have no more dependences in pernet_subsys. For pernet_device
>> it's vise versa.
>>
>> In case of bool drivers, the situation may be worse, because
>> the order is not so clear there. The same priority initcalls
>> (for example, initcalls, registered via core_initcall()) may require
>> the certain order, driven by linking order. I know one example from
>> device mapper code, which lives here: drivers/md/Makefile.
>> This problem is also solvable, even if such places do not contain
>> comments about linking order. It's just need to respect Makefile
>> order, when choosing a new candidate to move.
>>  
>>>   I think I would enforce the ordering with a failure to register
>>>   if a subsystem or a device tried to register out of order.  
>>>
>>> - Disable use of the single threaded workqueue if nothing needs the
>>>   net_mutex.
>>
>> We may use per-cpu worqueues in the future. The idea to refuse using
>> worqueue doesn't seem good for me, because asynchronous net destroying
>> looks very useful.
> 
> per-cpu workqueues are fine, and definitely what I am expecting.  If we
> are doing this I want to get us off the single threaded workqueue that
> serializes all of the cleanup.  That has a huge potential for
> simplifying things and reducing maintenance if running everything in
> parallel is actually safe.
> 
> I forget how the modern per-cpu workqueues work with respect to sleeps
> and locking (I don't remember if when piece of work sleeps for a long
> time we spin up another thread per-cpu workqueue thread, and thus avoid
> priority inversion problems).
> 
> If locks between workqueues are not a problem we could start the
> transition off of the single-threaded serializing workqueue sooner
> rather than later.
> 
>>> - Add a test mode that deliberartely spawns threads on multiple
>>>   processors and deliberately creates multiple network namespaces
>>>   at the same time.
>>>
>>> - Add a test mode that deliberately spawns threads on multiple
>>>   processors and delibearate destrosy multiple network namespaces
>>>   at the same time.>
>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>   time.  Being careful with the loopback device it's order in the list
>>>   strongly matters.
>>>
>>> - Finally remove the unnecessary code.
>>>
>>>
>>> At the end of the day because all of the operations for one network
>>> namespace will run in parallel with all of the operations for another
>>> network namespace all of the sophistication that goes into batching the
>>> cleanup of multiple network namespaces can be removed.  As different
>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>> without slowing each other down.
>>>
>>> I think we can remove the batching but I am afraid that will lead us into
>>> rtnl_lock contention.
>>
>> I've looked into this lock. It used in many places for many reasons.
>> It's a little strange, nobody tried to break it up in several small locks..
> 
> It gets tricky.  At this point getting net_mutex is enough to start
> with. Mostly the rtnl_lock covers the slow path which tends to keep it
> from rising to the top of the priority list.
> 
>>> I am definitely not comfortable with changing the locking on so much
>>> code without an explanation at all why it is safe in the commit comments
>>> in all 204 instances.  Which combined equal most of the well maintained
>>> and regularly used part of the networking stack.
>>
>> David,
>>
>> could you please check the plan above and say whether 1)it's OK for you,
>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>> or we may go sequentially(50 patches a time, for example)?
> 
> I think I would start with the something covering the core networking
> pieces so the improvements can be seen.

Since the main problem is mutex held during synchronize_rcu(), the performance
improvements may become seen only after it's completely removed.