netdev - Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fa6642c0-cd83-87b9-989c-8f87f3131ee2@virtuozzo.com>
Date:   Tue, 14 Nov 2017 23:43:28 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Andrei Vagin <avagin@...tuozzo.com>
Cc:     davem@...emloft.net, vyasevic@...hat.com,
        kstewart@...uxfoundation.org, pombredanne@...b.com,
        vyasevich@...il.com, mark.rutland@....com,
        gregkh@...uxfoundation.org, adobriyan@...il.com, fw@...len.de,
        nicolas.dichtel@...nd.com, xiyou.wangcong@...il.com,
        roman.kapl@...go.com, paul@...l-moore.com, dsahern@...il.com,
        daniel@...earbox.net, lucien.xin@...il.com,
        mschiffer@...verse-factory.net, rshearma@...cade.com,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        ebiederm@...ssion.com, gorcunov@...tuozzo.com
Subject: Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it
 on net->init/->exit

On 14.11.2017 21:38, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote:
>> On 14.11.2017 20:44, Andrei Vagin wrote:
>>> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>>
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>>
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>>   mutex_lock(&net_mutex)
>>>>   list_del_rcu(&net->list)
>>>>   synchronize_rcu()                                  <--- Sleep there for ages
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_exit_list(ops, &net_exit_list)
>>>>   list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>     ops_free_list(ops, &net_exit_list)
>>>>   mutex_unlock(&net_mutex)
>>>>
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>
>>>> Create net_ns:
>>>> copy_net_ns()
>>>>   mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
>>>>
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>>
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>>
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>>         unsigned nr;
>>>>         if (argc < 2) {
>>>>                 fprintf(stderr, "Provide nr iterations arg\n");
>>>>                 return 1;
>>>>         }
>>>>         nr = atoi(argv[1]);
>>>>         while (nr-- > 0) {
>>>>                 if (unshare(CLONE_NEWNET)) {
>>>>                         perror("Can't unshare");
>>>>                         return 1;
>>>>                 }
>>>>         }
>>>>         return 0;
>>>> }
>>>>
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>
>>> Hi Kirill,
>>>
>>> This mutex has another role. You know that net namespaces are destroyed
>>> asynchronously, and the net mutex gurantees that a backlog will be not
>>> big. If we have something in backlog, we know that it will be handled
>>> before creating a new net ns.
>>>
>>> As far as I remember net namespaces are created much faster than
>>> they are destroyed, so with this changes we can create a really big
>>> backlog, can't we?
>>
>> I don't think limitation is a good goal or a gool for the mutex,
>> because it's very easy to create many net namespaces in case of
>> the mutex exists. You may open /proc/[pid]/ns/net like a file,
>> and net_ns counter will increment. Then, do unshare(), and
>> the mutex has no a way to protect against that.
> 
> You are right, but with the mutex a user can not support a big backlog
> for a long time, it is shrunk to zero periodically. With these changes
> he can support a big backlog for a long time.
> 
> A big backlog affects other users. If someone creates namespaces, he
> probably expects that they will be destroyed for a reasonable time.

Here 2 different topics:
1)Limitation.
This problem exists in current code, and the patch does not make it worse.
If someone wants to limit number of net namespaces per user_ns/cgroup,
he should solve it in another way. Mutex does not help there.

2)Speed-up struct net destruction.
Many structures in kernel are released via rcu. The only difference
of struct net is it has more delayed actions, then task_struct for example.
User is not interested in the time, when task_struct is completely destroyed.
But I may imagine it's more important for struct net, if (for example) some
delayed connections are destroyed from ops->exit(). Are they?!
But like in the first paragraph, I think this is not related to the patch,
because in the current code you may generate dozen thousands of net namespaces
and then destroy them at once. And they will be destroyed for the same long time,
and there is additional worse thing, that it won't be possible to create a new
one for a long period. So, it's completely unrelated to the patch.
 
> But currently someone else can increase a destroy time to a really big
> values. This problem was before your patches, but they may do this
> problem worse. The question here is: Should we think about this problem
> in the context of these patches?

I think we shouldn't, because the main thing in this patch is parallel
execution of ops->init() and ->exit() and their safety. I hope people
will look at this and nothing will remain hidden.

The problem of long time destruction of net namespaces may be solved as next step.
As we will be able to destroy them in parallel, we'll implement percpu destroy
works and will destroy them much time faster then now.
 
>> Anyway, mutex
>> can't limit a number of something in general, I've never seen
>> a (good) example in kernel.
> 
> I'm agree with you here.
> 
> 
>>
>> As I see, the real limitation happen in inc_net_namespaces(),
>> which is decremented after RCU grace period in cleanup_net(),
>> and it has not changed.
> 
> ucount limits are to big to handle this problem.
> 
> 
>>
>>> There was a discussion a few month ago:
>>> https://lists.onap.org/pipermail/containers/2016-October/037509.html
>>>
>>>
>>>>
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>
>>> Here you measure time of creating and destroying net namespaces.
>>>
>>>>
>>>> Patched:
>>>> real 0m18,235s   (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>
>>> But here you measure time of crearing namespaces and you know nothing
>>> when they will be destroyed.
>>
>> You're right, and I predict, the sum time, spent on cpu, will remain the same,
>> but the think is that now creation and destroying may be executed in parallel.