netdev - Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <71411e21-0efa-0e41-836d-0301b1c1dbcb@virtuozzo.com>
Date:   Wed, 7 Mar 2018 12:22:08 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Eric Dumazet <eric.dumazet@...il.com>, davem@...emloft.net,
        yoshfuji@...ux-ipv6.org, netdev@...r.kernel.org,
        Yuval Mintz <yuvalm@...lanox.com>
Subject: Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in
 ip6mr_sk_done() only if this is needed

On 06.03.2018 19:50, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>> After unshare test kworker hangs for ages:
>>
>>     $ while :; do unshare -n true; done &
>>
>>     $ perf report <kworker>
>>     -   88,82%     0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>> cleanup_net
>>          cleanup_net
>>        - ops_exit_list.isra.9
>>           - 85,68% igmp6_net_exit
>>              - 53,31% sock_release
>>                 - inet_release
>>                    - 25,33% rawv6_close
>>                       - ip6mr_sk_done
>>                          + 23,38% synchronize_rcu
>>
>> Keep in mind, this perf report shows the time a function was
>> executing, and
>> it does not show the time, it was sleeping. But it's easy to imagine,
>> how
>> much it is sleeping, if synchronize_rcu() execution takes the most
>> time.
>> Top shows the kworker R time is less then 1%.
>>
>> This happen, because of in ip6mr_sk_done() we do too many
>> synchronize_rcu(),
>> even for the sockets, that are not referenced in mr_table, and which
>> are not
>> need it. So, the progress of kworker becomes very slow.
>>
>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>> to skip
>> synchronize_rcu() for sockets, that are not need that. After the
>> patch,
>> kworker becomes able to warm the cpu up as expected.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@...tuozzo.com>
>> ---
>>  net/ipv6/ip6mr.c |    4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>> index 2a38f9de45d3..290a8d0d5eac 100644
>> --- a/net/ipv6/ip6mr.c
>> +++ b/net/ipv6/ip6mr.c
>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>  		}
>>  	}
>>  	rtnl_unlock();
>> -	synchronize_rcu();
>> +
>> +	if (!err)
>> +		synchronize_rcu();
>>  
> 
> 
> But... what is this synchronize_rcu() doing exactly ?
> 
> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> 
> ("ip6mr: Make mroute_sk rcu-based")
> 
> Typically on a delete, the synchronize_rcu() would be needed before
> freeing the deleted object.
> 
> But nowadays we have better way : SOCK_RCU_FREE

Hm. I'm agree with you. This is hot path, and mroute sockets created from userspace
will delay userspace tasks close() and exit(). Since there may be many such sockets,
we may get a zombie task, which can't be reaped for ages. This slows down the system
directly.

Fix for pernet_operations works, but we need generic solution instead.

The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:

    ip6mr: Make mroute_sk rcu-based
    
    In ipmr the mr_table socket is handled under RCU. Introduce the same
    for ip6mr.

There is no pointing to improvements it invents, or to the problem it solves. The description
looks like a cleanup. It's too expensive cleanup, if it worsens the performance a hundred
times.

Can't we simply revert it?!

Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by Eric)?

We actually use rcu_dereference() in ip6mr_cache_report() only. The only user of dereference
is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in inet_sock_destruct().

Thanks,
Kirill