linux-kernel - Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f54357a9-cfb9-ea2e-5176-fbce69754b57@virtuozzo.com>
Date:   Tue, 14 Nov 2017 21:04:06 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Andrei Vagin <avagin@...tuozzo.com>
Cc:     davem@...emloft.net, vyasevic@...hat.com,
        kstewart@...uxfoundation.org, pombredanne@...b.com,
        vyasevich@...il.com, mark.rutland@....com,
        gregkh@...uxfoundation.org, adobriyan@...il.com, fw@...len.de,
        nicolas.dichtel@...nd.com, xiyou.wangcong@...il.com,
        roman.kapl@...go.com, paul@...l-moore.com, dsahern@...il.com,
        daniel@...earbox.net, lucien.xin@...il.com,
        mschiffer@...verse-factory.net, rshearma@...cade.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        ebiederm@...ssion.com, gorcunov@...tuozzo.com
Subject: Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it
 on net->init/->exit

On 14.11.2017 20:44, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
>> Curently mutex is used to protect pernet operations list. It makes
>> cleanup_net() to execute ->exit methods of the same operations set,
>> which was used on the time of ->init, even after net namespace is
>> unlinked from net_namespace_list.
>>
>> But the problem is it's need to synchronize_rcu() after net is removed
>> from net_namespace_list():
>>
>> Destroy net_ns:
>> cleanup_net()
>>   mutex_lock(&net_mutex)
>>   list_del_rcu(&net->list)
>>   synchronize_rcu()                                  <--- Sleep there for ages
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_exit_list(ops, &net_exit_list)
>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>     ops_free_list(ops, &net_exit_list)
>>   mutex_unlock(&net_mutex)
>>
>> This primitive is not fast, especially on the systems with many processors
>> and/or when preemptible RCU is enabled in config. So, all the time, while
>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>
>> Create net_ns:
>> copy_net_ns()
>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>
>> The solution is to convert net_mutex to the rw_semaphore. Then,
>> pernet_operations::init/::exit methods, modifying the net-related data,
>> will require down_read() locking only, while down_write() will be used
>> for changing pernet_list.
>>
>> This gives signify performance increase, like you may see below. There
>> is measured sequential net namespace creation in a cycle, in single
>> thread, without other tasks (single user mode):
>>
>> 1)int main(int argc, char *argv[])
>> {
>>         unsigned nr;
>>         if (argc < 2) {
>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>                 return 1;
>>         }
>>         nr = atoi(argv[1]);
>>         while (nr-- > 0) {
>>                 if (unshare(CLONE_NEWNET)) {
>>                         perror("Can't unshare");
>>                         return 1;
>>                 }
>>         }
>>         return 0;
>> }
>>
>> Origin, 100000 unshare():
>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>
>> Patched, 100000 unshare():
>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>
>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
> 
> Hi Kirill,
> 
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
> 
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?

I don't think limitation is a good goal or a gool for the mutex,
because it's very easy to create many net namespaces in case of
the mutex exists. You may open /proc/[pid]/ns/net like a file,
and net_ns counter will increment. Then, do unshare(), and
the mutex has no a way to protect against that. Anyway, mutex
can't limit a number of something in general, I've never seen
a (good) example in kernel.

As I see, the real limitation happen in inc_net_namespaces(),
which is decremented after RCU grace period in cleanup_net(),
and it has not changed.

> There was a discussion a few month ago:
> https://lists.onap.org/pipermail/containers/2016-October/037509.html
> 
> 
>>
>> Origin:
>> real 1m24,190s
>> user 0m6,225s
>> sys 0m15,132s
> 
> Here you measure time of creating and destroying net namespaces.
> 
>>
>> Patched:
>> real 0m18,235s   (4.6 times faster)
>> user 0m4,544s
>> sys 0m13,796s
> 
> But here you measure time of crearing namespaces and you know nothing
> when they will be destroyed.

You're right, and I predict, the sum time, spent on cpu, will remain the same,
but the think is that now creation and destroying may be executed in parallel.