[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220901212520.11421-1-kuniyu@amazon.com>
Date: Thu, 1 Sep 2022 14:25:20 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <pabeni@...hat.com>, <edumazet@...gle.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <kuni1840@...il.com>,
<kuniyu@...zon.com>, <netdev@...r.kernel.org>
Subject: Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.
From: Paolo Abeni <pabeni@...hat.com>
Date: Thu, 01 Sep 2022 12:57:33 +0200
> On Tue, 2022-08-30 at 12:15 -0700, Kuniyuki Iwashima wrote:
> > We will soon introduce an optional per-netns ehash.
> >
> > This means we cannot use tcp_hashinfo directly in most places.
> >
> > Instead, access it via net->ipv4.tcp_death_row->hashinfo.
> >
> > The access will be valid only while initialising tcp_hashinfo
> > itself and creating/destroying each netns.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > ---
> > .../chelsio/inline_crypto/chtls/chtls_cm.c | 5 +-
> > net/core/filter.c | 5 +-
> > net/ipv4/esp4.c | 3 +-
> > net/ipv4/inet_hashtables.c | 2 +-
> > net/ipv4/netfilter/nf_socket_ipv4.c | 2 +-
> > net/ipv4/netfilter/nf_tproxy_ipv4.c | 17 +++--
> > net/ipv4/tcp_diag.c | 18 +++++-
> > net/ipv4/tcp_ipv4.c | 63 +++++++++++--------
> > net/ipv4/tcp_minisocks.c | 2 +-
> > net/ipv6/esp6.c | 3 +-
> > net/ipv6/inet6_hashtables.c | 4 +-
> > net/ipv6/netfilter/nf_socket_ipv6.c | 2 +-
> > net/ipv6/netfilter/nf_tproxy_ipv6.c | 5 +-
> > net/ipv6/tcp_ipv6.c | 16 ++---
> > net/mptcp/mptcp_diag.c | 7 ++-
> > 15 files changed, 92 insertions(+), 62 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > index ddfe9208529a..f90bfba4b303 100644
> > --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > @@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
> > cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
> > }
> >
> > -static void inet_inherit_port(struct inet_hashinfo *hash_info,
> > - struct sock *lsk, struct sock *newsk)
> > +static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
> > {
> > local_bh_disable();
> > __inet_inherit_port(lsk, newsk);
> > @@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
> > ipv4.sysctl_tcp_window_scaling),
> > tp->window_clamp);
> > neigh_release(n);
> > - inet_inherit_port(&tcp_hashinfo, lsk, newsk);
> > + inet_inherit_port(lsk, newsk);
> > csk_set_flag(csk, CSK_CONN_INLINE);
> > bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */
>
> I looks to me that the above chunks are functionally a no-op and I
> think that omitting the 2 drivers from the v2:
>
> https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/
>
> should break mlx5/nfp inside a netns. I don't understand why including
> the above and skipping the latters?!? I guess is a question mostly for
> Eric :)
My best guess is that it's ok unless it does not touch TCP stack deeply
and if it does, the driver developer must catch up with the core changes
not to burden maintainers...?
If so, I understand that take. OTOH, I also don't want to break anything
when we know the change would do.
So, I'm fine to either stay as is or add the change in v4 again.
>
> > @@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
> >
> > int tcp_v4_early_demux(struct sk_buff *skb)
> > {
> > + struct net *net = dev_net(skb->dev);
> > const struct iphdr *iph;
> > const struct tcphdr *th;
> > struct sock *sk;
> > @@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
> > if (th->doff < sizeof(struct tcphdr) / 4)
> > return 0;
> >
> > - sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
> > + sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
> > iph->saddr, th->source,
> > iph->daddr, ntohs(th->dest),
> > skb->skb_iif, inet_sdif(skb));
>
> /Me is thinking aloud...
>
> I'm wondering if the above has some measurable negative effect for
> large deployments using only the main netns?
>
> Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> >hashinfo already into the working set data for established socket?
> Would the above increase the WSS by 2 cache-lines?
Currently, the death_row and hashinfo are touched around tw sockets or
connect(). If connections on the deployment are short-lived or frequently
initiated by itself, that would be host and included in WSS.
If the workload is server and there's no active-close() socket or
connections are long-lived, then it might not be included in WSS.
But I think it's not likely than the former if the deployment is
large enough.
If this change had large impact, then we could revert fbb8295248e1
which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
that tried to fire a TW timer after netns is freed, but 0dad4087a86a
has already reverted.
>
> Thanks!
>
> Paolo
Powered by blists - more mailing lists