netdev - Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1520432537.109662.39.camel@gmail.com>
Date:   Wed, 07 Mar 2018 06:22:17 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Yuval Mintz <yuvalm@...lanox.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     "davem@...emloft.net" <davem@...emloft.net>,
        "yoshfuji@...ux-ipv6.org" <yoshfuji@...ux-ipv6.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in
 ip6mr_sk_done() only if this is needed

On Wed, 2018-03-07 at 13:51 +0000, Yuval Mintz wrote:
> > > > > After unshare test kworker hangs for ages:
> > > > > 
> > > > >     $ while :; do unshare -n true; done &
> > > > > 
> > > > >     $ perf report <kworker>
> > > > >     -   88,82%     0,00%  kworker/u16:0  [kernel.vmlinux]  [k
> > > > > ]
> > > > > cleanup_net
> > > > >          cleanup_net
> > > > >        - ops_exit_list.isra.9
> > > > >           - 85,68% igmp6_net_exit
> > > > >              - 53,31% sock_release
> > > > >                 - inet_release
> > > > >                    - 25,33% rawv6_close
> > > > >                       - ip6mr_sk_done
> > > > >                          + 23,38% synchronize_rcu
> > > > > 
> > > > > Keep in mind, this perf report shows the time a function was
> > > > > executing, and
> > > > > it does not show the time, it was sleeping. But it's easy to
> > > > > imagine,
> > > > > how
> > > > > much it is sleeping, if synchronize_rcu() execution takes the
> > > > > most
> > > > > time.
> > > > > Top shows the kworker R time is less then 1%.
> > > > > 
> > > > > This happen, because of in ip6mr_sk_done() we do too many
> > > > > synchronize_rcu(),
> > > > > even for the sockets, that are not referenced in mr_table,
> > > > > and which
> > > > > are not
> > > > > need it. So, the progress of kworker becomes very slow.
> > > > > 
> > > > > The patch introduces apparent solution, and it makes
> > > > > ip6mr_sk_done()
> > > > > to skip
> > > > > synchronize_rcu() for sockets, that are not need that. After
> > > > > the
> > > > > patch,
> > > > > kworker becomes able to warm the cpu up as expected.
> > > > > 
> > > > > Signed-off-by: Kirill Tkhai <ktkhai@...tuozzo.com>
> > > > > ---
> > > > >  net/ipv6/ip6mr.c |    4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> > > > > index 2a38f9de45d3..290a8d0d5eac 100644
> > > > > --- a/net/ipv6/ip6mr.c
> > > > > +++ b/net/ipv6/ip6mr.c
> > > > > @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
> > > > >  		}
> > > > >  	}
> > > > >  	rtnl_unlock();
> > > > > -	synchronize_rcu();
> > > > > +
> > > > > +	if (!err)
> > > > > +		synchronize_rcu();
> > > > > 
> > > > 
> > > > 
> > > > But... what is this synchronize_rcu() doing exactly ?
> > > > 
> > > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> > > > 
> > > > ("ip6mr: Make mroute_sk rcu-based")
> > > > 
> > > > Typically on a delete, the synchronize_rcu() would be needed
> > > > before
> > > > freeing the deleted object.
> > > > 
> > > > But nowadays we have better way : SOCK_RCU_FREE
> > > 
> > > Hm. I'm agree with you. This is hot path, and mroute sockets
> > > created from
> > 
> > userspace
> > > will delay userspace tasks close() and exit(). Since there may be
> > > many such
> > 
> > sockets,
> > > we may get a zombie task, which can't be reaped for ages. This
> > > slows down
> > 
> > the system
> > > directly.
> > > 
> > > Fix for pernet_operations works, but we need generic solution
> > > instead.
> > > 
> > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> > > 
> > >     ip6mr: Make mroute_sk rcu-based
> > > 
> > >     In ipmr the mr_table socket is handled under RCU. Introduce
> > > the same
> > >     for ip6mr.
> > > 
> > > There is no pointing to improvements it invents, or to the
> > > problem it solves.
> > 
> > The description
> > > looks like a cleanup. It's too expensive cleanup, if it worsens
> > > the
> > 
> > performance a hundred
> > > times.
> > > 
> > > Can't we simply revert it?!
> > > 
> > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE
> > > suggested
> > 
> > by Eric)?
> 
> Sorry, failed to notice ip6mr_sk_done() is called unconditionally
> from
> rawv6_close(). But even with your suggested fix it should be
> ~resolved
> [as only sockets used for ip6mr would reach the sync]

Yes but some user setups could still hit this synchronize_rcu() quite
hard.

New synchronize_rcu() should not be added in possibly critical paths
unless absolutely needed (if really there is no other ways)

We have to be very careful here, since this kind of mistake can have a
100x impact.

> .
> Or are you claiming there's still some performance hit even with your
> suggested change?
> 
> > > 
> > > We actually use rcu_dereference() in ip6mr_cache_report() only.
> > > The only
> > 
> > user of dereference
> > > is sock_queue_rcv_skb(). Superficial analysis shows we purge the
> > > queue in
> > 
> > inet_sock_destruct().
> > 
> > + So this change should be safe.
> 
> I might have misunderstood the comment from
> commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when
> writing
> this; Thought comment regarding ip_ra_destroy() meant that for the
> IPv6 case
> we DO have to make sure there's a grace-period before destroying the
> socket.

Yeah but since 2010 things have changed a lot in the kernel ;)

And we now have SOCK_RCU_FREE for the few sockets that need RCU grace
period at dismantle.

Note that TCP sockets are RCU protected but do not request
SOCK_RCU_FREE in general (only listeners use this)

SOCK_RCU_FREE will not require a synchronize_rcu() stall, and instead
use the asynchronous RCU infrastructure ( call_rcu() )

So all we need now is to make sure these (RAW / ICMPV6) IPv6 sockets
have this flag set, then we can remove this problematic
synchronize_rcu() from ip6mr_sk_done()

I will test the following fix.

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 2a38f9de45d399fafc9f4bcc662b44be17279e51..7345bd6c4b7dda39c0d73d542e9ca9a5366542ff 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1443,6 +1443,7 @@ static int ip6mr_sk_init(struct mr_table *mrt, struct sock *sk)
 		err = -EADDRINUSE;
 	} else {
 		rcu_assign_pointer(mrt->mroute_sk, sk);
+		sock_set_flag(sk, SOCK_RCU_FREE);
 		net->ipv6.devconf_all->mc_forwarding++;
 	}
 	write_unlock_bh(&mrt_lock);
@@ -1472,6 +1473,10 @@ int ip6mr_sk_done(struct sock *sk)
 		if (sk == rtnl_dereference(mrt->mroute_sk)) {
 			write_lock_bh(&mrt_lock);
 			RCU_INIT_POINTER(mrt->mroute_sk, NULL);
+			/* Note that mroute_sk had SOCK_RCU_FREE set,
+			 * so the RCU grace period before sk freeing
+			 * is guaranteed by sk_destruct()
+			 */
 			net->ipv6.devconf_all->mc_forwarding--;
 			write_unlock_bh(&mrt_lock);
 			inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
@@ -1485,7 +1490,6 @@ int ip6mr_sk_done(struct sock *sk)
 		}
 	}
 	rtnl_unlock();
-	synchronize_rcu();
 
 	return err;
 }