lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220901212520.11421-1-kuniyu@amazon.com>
Date:   Thu, 1 Sep 2022 14:25:20 -0700
From:   Kuniyuki Iwashima <kuniyu@...zon.com>
To:     <pabeni@...hat.com>, <edumazet@...gle.com>
CC:     <davem@...emloft.net>, <kuba@...nel.org>, <kuni1840@...il.com>,
        <kuniyu@...zon.com>, <netdev@...r.kernel.org>
Subject: Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.

From:   Paolo Abeni <pabeni@...hat.com>
Date:   Thu, 01 Sep 2022 12:57:33 +0200
> On Tue, 2022-08-30 at 12:15 -0700, Kuniyuki Iwashima wrote:
> > We will soon introduce an optional per-netns ehash.
> > 
> > This means we cannot use tcp_hashinfo directly in most places.
> > 
> > Instead, access it via net->ipv4.tcp_death_row->hashinfo.
> > 
> > The access will be valid only while initialising tcp_hashinfo
> > itself and creating/destroying each netns.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > ---
> >  .../chelsio/inline_crypto/chtls/chtls_cm.c    |  5 +-
> >  net/core/filter.c                             |  5 +-
> >  net/ipv4/esp4.c                               |  3 +-
> >  net/ipv4/inet_hashtables.c                    |  2 +-
> >  net/ipv4/netfilter/nf_socket_ipv4.c           |  2 +-
> >  net/ipv4/netfilter/nf_tproxy_ipv4.c           | 17 +++--
> >  net/ipv4/tcp_diag.c                           | 18 +++++-
> >  net/ipv4/tcp_ipv4.c                           | 63 +++++++++++--------
> >  net/ipv4/tcp_minisocks.c                      |  2 +-
> >  net/ipv6/esp6.c                               |  3 +-
> >  net/ipv6/inet6_hashtables.c                   |  4 +-
> >  net/ipv6/netfilter/nf_socket_ipv6.c           |  2 +-
> >  net/ipv6/netfilter/nf_tproxy_ipv6.c           |  5 +-
> >  net/ipv6/tcp_ipv6.c                           | 16 ++---
> >  net/mptcp/mptcp_diag.c                        |  7 ++-
> >  15 files changed, 92 insertions(+), 62 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > index ddfe9208529a..f90bfba4b303 100644
> > --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> > @@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
> >  	cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
> >  }
> >  
> > -static void inet_inherit_port(struct inet_hashinfo *hash_info,
> > -			      struct sock *lsk, struct sock *newsk)
> > +static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
> >  {
> >  	local_bh_disable();
> >  	__inet_inherit_port(lsk, newsk);
> > @@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
> >  						     ipv4.sysctl_tcp_window_scaling),
> >  					   tp->window_clamp);
> >  	neigh_release(n);
> > -	inet_inherit_port(&tcp_hashinfo, lsk, newsk);
> > +	inet_inherit_port(lsk, newsk);
> >  	csk_set_flag(csk, CSK_CONN_INLINE);
> >  	bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */
> 
> I looks to me that the above chunks are functionally a no-op and I
> think that omitting the 2 drivers from the v2:
> 
> https://lore.kernel.org/netdev/20220829161920.99409-4-kuniyu@amazon.com/
> 
> should break mlx5/nfp inside a netns. I don't understand why including
> the above and skipping the latters?!? I guess is a question mostly for
> Eric :)

My best guess is that it's ok unless it does not touch TCP stack deeply
and if it does, the driver developer must catch up with the core changes
not to burden maintainers...?

If so, I understand that take.  OTOH, I also don't want to break anything
when we know the change would do.

So, I'm fine to either stay as is or add the change in v4 again.


> 
> > @@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
> >  
> >  int tcp_v4_early_demux(struct sk_buff *skb)
> >  {
> > +	struct net *net = dev_net(skb->dev);
> >  	const struct iphdr *iph;
> >  	const struct tcphdr *th;
> >  	struct sock *sk;
> > @@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
> >  	if (th->doff < sizeof(struct tcphdr) / 4)
> >  		return 0;
> >  
> > -	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
> > +	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
> >  				       iph->saddr, th->source,
> >  				       iph->daddr, ntohs(th->dest),
> >  				       skb->skb_iif, inet_sdif(skb));
> 
> /Me is thinking aloud...
> 
> I'm wondering if the above has some measurable negative effect for
> large deployments using only the main netns?
> 
> Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-
> >hashinfo already into the working set data for established socket?
> Would the above increase the WSS by 2 cache-lines?

Currently, the death_row and hashinfo are touched around tw sockets or
connect().  If connections on the deployment are short-lived or frequently
initiated by itself, that would be host and included in WSS.

If the workload is server and there's no active-close() socket or
connections are long-lived, then it might not be included in WSS.
But I think it's not likely than the former if the deployment is
large enough.

If this change had large impact, then we could revert fbb8295248e1
which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
that tried to fire a TW timer after netns is freed, but 0dad4087a86a
has already reverted.


> 
> Thanks!
> 
> Paolo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ