[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220830014947.27510-1-kuniyu@amazon.com>
Date: Mon, 29 Aug 2022 18:49:47 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <kuniyu@...zon.com>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
<kuni1840@...il.com>, <netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v2 net-next 4/5] tcp: Save unnecessary inet_twsk_purge() calls.
From: Kuniyuki Iwashima <kuniyu@...zon.com>
Date: Mon, 29 Aug 2022 16:34:53 -0700
> From: Eric Dumazet <edumazet@...gle.com>
> Date: Mon, 29 Aug 2022 16:11:57 -0700
> > On Mon, Aug 29, 2022 at 9:21 AM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > >
> > > While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
> > > and tcpv6_net_exit_batch() for AF_INET and AF_INET6. These commands
> > > trigger the kernel to walk through the potentially big ehash twice even
> > > though the netns has no TIME_WAIT sockets.
> > >
> > > # ip netns add test
> > > # ip netns del test
> > >
> > > AF_INET6 uses module_init() to be loaded after AF_INET which uses
> > > fs_initcall(), so tcpv6_net_ops is always registered after tcp_sk_ops.
> > > Also, we clean up netns in the reverse order, so tcpv6_net_exit_batch()
> > > is always called before tcp_sk_exit_batch().
> > >
> > > The characteristic enables us to save such unneeded iterations. This
> > > change eliminates the tax by the additional unshare() described in the
> > > next patch to guarantee the per-netns ehash size.
> >
> > Patch seems wrong to me, or not complete at least...
> >
> > It seems you missed an existing check in inet_twsk_purge():
> >
> > if ((tw->tw_family != family) ||
> > refcount_read(&twsk_net(tw)->ns.count))
> > continue;
> >
> > To get rid of both IPv6 and IPv6 tw sockets, we currently need to call
> > inet_twsk_purge() twice,
> > with AF_INET and AF_INET6.
>
> For the first call of AF_INET6, if tw_refcount is 1, the count is what
> is set by tcp_sk_init() and there is no tw socket at least in the netns.
>
> And same for the AF_INET, if tw_refcount is 0 and death_row is NULL,
> there is no tw socket in the netns.
>
> If all netns in net_exit_list satify the condition, we need not call
> inet_twsk_purge().
>
> However, one of netns in the list doesn't satisfy it, we have to call
> inet_twsk_purge() twice. Then we need to check the ns.count because
...because the netns has tw socket in the global ehash and other
netns not in the list (not freed) may have tw sockets there.
> another netns in the list may be freed in tcp_sk_exit().
Please ignore this part, I seemed to be confused :)
So, we still need to call inet_twsk_purge() if there are tw sockets
in any netns in the batch list, but need not if there's no tw sockets
to get rid of in all netns.
>
> So, the optimisation works only when all netns in the batch list do not
> have tw sockets.
>
>
> > >
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > > ---
> > > include/net/tcp.h | 1 +
> > > net/ipv4/tcp_ipv4.c | 6 ++++--
> > > net/ipv4/tcp_minisocks.c | 24 ++++++++++++++++++++++++
> > > net/ipv6/tcp_ipv6.c | 2 +-
> > > 4 files changed, 30 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index d10962b9f0d0..f60996c1d7b3 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -346,6 +346,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb);
> > > void tcp_rcv_space_adjust(struct sock *sk);
> > > int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp);
> > > void tcp_twsk_destructor(struct sock *sk);
> > > +void tcp_twsk_purge(struct list_head *net_exit_list, int family);
> > > ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
> > > struct pipe_inode_info *pipe, size_t len,
> > > unsigned int flags);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index b07930643b11..f4a502d57d45 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -3109,8 +3109,10 @@ static void __net_exit tcp_sk_exit(struct net *net)
> > > if (net->ipv4.tcp_congestion_control)
> > > bpf_module_put(net->ipv4.tcp_congestion_control,
> > > net->ipv4.tcp_congestion_control->owner);
> > > - if (refcount_dec_and_test(&tcp_death_row->tw_refcount))
> > > + if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
> > > kfree(tcp_death_row);
> > > + net->ipv4.tcp_death_row = NULL;
> > > + }
> > > }
> > >
> > > static int __net_init tcp_sk_init(struct net *net)
> > > @@ -3210,7 +3212,7 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> > > {
> > > struct net *net;
> > >
> > > - inet_twsk_purge(&tcp_hashinfo, AF_INET);
> > > + tcp_twsk_purge(net_exit_list, AF_INET);
> > >
> > > list_for_each_entry(net, net_exit_list, exit_list)
> > > tcp_fastopen_ctx_destroy(net);
> > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > index 361aad67c6d6..9168c5a33344 100644
> > > --- a/net/ipv4/tcp_minisocks.c
> > > +++ b/net/ipv4/tcp_minisocks.c
> > > @@ -347,6 +347,30 @@ void tcp_twsk_destructor(struct sock *sk)
> > > }
> > > EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
> > >
> > > +void tcp_twsk_purge(struct list_head *net_exit_list, int family)
> > > +{
> > > + struct net *net;
> > > +
> > > + list_for_each_entry(net, net_exit_list, exit_list) {
> > > + if (!net->ipv4.tcp_death_row)
> > > + continue;
> > > +
> > > + /* AF_INET6 using module_init() is always registered after
> > > + * AF_INET using fs_initcall() and cleaned up in the reverse
> > > + * order.
> > > + *
> > > + * The last refcount is decremented later in tcp_sk_exit().
> > > + */
> > > + if (IS_ENABLED(CONFIG_IPV6) && family == AF_INET6 &&
> > > + refcount_read(&net->ipv4.tcp_death_row->tw_refcount) == 1)
> > > + continue;
> > > +
> > > + inet_twsk_purge(&tcp_hashinfo, family);
> > > + break;
> > > + }
> > > +}
> > > +EXPORT_SYMBOL_GPL(tcp_twsk_purge);
> > > +
> > > /* Warning : This function is called without sk_listener being locked.
> > > * Be sure to read socket fields once, as their value could change under us.
> > > */
> > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > > index 27b2fd98a2c4..9cbc7f0d7149 100644
> > > --- a/net/ipv6/tcp_ipv6.c
> > > +++ b/net/ipv6/tcp_ipv6.c
> > > @@ -2229,7 +2229,7 @@ static void __net_exit tcpv6_net_exit(struct net *net)
> > >
> > > static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list)
> > > {
> > > - inet_twsk_purge(&tcp_hashinfo, AF_INET6);
> > > + tcp_twsk_purge(net_exit_list, AF_INET6);
> > > }
> > >
> > > static struct pernet_operations tcpv6_net_ops = {
> > > --
> > > 2.30.2
> > >
Powered by blists - more mailing lists