netdev - Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b44fe837-3003-2547-afd4-52f3213a6642@virtuozzo.com>
Date:   Wed, 7 Mar 2018 12:32:49 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Eric Dumazet <eric.dumazet@...il.com>, davem@...emloft.net,
        yoshfuji@...ux-ipv6.org, netdev@...r.kernel.org,
        Yuval Mintz <yuvalm@...lanox.com>
Subject: Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in
 ip6mr_sk_done() only if this is needed

On 07.03.2018 12:22, Kirill Tkhai wrote:
> On 06.03.2018 19:50, Eric Dumazet wrote:
>> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>>> After unshare test kworker hangs for ages:
>>>
>>>     $ while :; do unshare -n true; done &
>>>
>>>     $ perf report <kworker>
>>>     -   88,82%     0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>>> cleanup_net
>>>          cleanup_net
>>>        - ops_exit_list.isra.9
>>>           - 85,68% igmp6_net_exit
>>>              - 53,31% sock_release
>>>                 - inet_release
>>>                    - 25,33% rawv6_close
>>>                       - ip6mr_sk_done
>>>                          + 23,38% synchronize_rcu
>>>
>>> Keep in mind, this perf report shows the time a function was
>>> executing, and
>>> it does not show the time, it was sleeping. But it's easy to imagine,
>>> how
>>> much it is sleeping, if synchronize_rcu() execution takes the most
>>> time.
>>> Top shows the kworker R time is less then 1%.
>>>
>>> This happen, because of in ip6mr_sk_done() we do too many
>>> synchronize_rcu(),
>>> even for the sockets, that are not referenced in mr_table, and which
>>> are not
>>> need it. So, the progress of kworker becomes very slow.
>>>
>>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>>> to skip
>>> synchronize_rcu() for sockets, that are not need that. After the
>>> patch,
>>> kworker becomes able to warm the cpu up as expected.
>>>
>>> Signed-off-by: Kirill Tkhai <ktkhai@...tuozzo.com>
>>> ---
>>>  net/ipv6/ip6mr.c |    4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>>> index 2a38f9de45d3..290a8d0d5eac 100644
>>> --- a/net/ipv6/ip6mr.c
>>> +++ b/net/ipv6/ip6mr.c
>>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>>  		}
>>>  	}
>>>  	rtnl_unlock();
>>> -	synchronize_rcu();
>>> +
>>> +	if (!err)
>>> +		synchronize_rcu();
>>>  
>>
>>
>> But... what is this synchronize_rcu() doing exactly ?
>>
>> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
>>
>> ("ip6mr: Make mroute_sk rcu-based")
>>
>> Typically on a delete, the synchronize_rcu() would be needed before
>> freeing the deleted object.
>>
>> But nowadays we have better way : SOCK_RCU_FREE
> 
> Hm. I'm agree with you. This is hot path, and mroute sockets created from userspace
> will delay userspace tasks close() and exit(). Since there may be many such sockets,
> we may get a zombie task, which can't be reaped for ages. This slows down the system
> directly.
> 
> Fix for pernet_operations works, but we need generic solution instead.
> 
> The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> 
>     ip6mr: Make mroute_sk rcu-based
>     
>     In ipmr the mr_table socket is handled under RCU. Introduce the same
>     for ip6mr.
> 
> There is no pointing to improvements it invents, or to the problem it solves. The description
> looks like a cleanup. It's too expensive cleanup, if it worsens the performance a hundred
> times.
> 
> Can't we simply revert it?!
> 
> Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by Eric)?
> 
> We actually use rcu_dereference() in ip6mr_cache_report() only. The only user of dereference
> is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in inet_sock_destruct().

+ So this change should be safe.

> Thanks,
> Kirill
>