[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+pfVeH0Gs4tFPcZstnfxjz-Vp2D86H5AQsdsR_+p_3qQ@mail.gmail.com>
Date: Fri, 26 Aug 2022 08:17:25 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Jeff Layton <jlayton@...nel.org>,
Chuck Lever <chuck.lever@...cle.com>,
Luis Chamberlain <mcgrof@...nel.org>,
Kees Cook <keescook@...omium.org>,
Iurii Zaikin <yzaikin@...gle.com>,
Kuniyuki Iwashima <kuni1840@...il.com>,
netdev <netdev@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns
hash table.
On Thu, Aug 25, 2022 at 5:05 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
>
> The more sockets we have in the hash table, the more time we spend
> looking up the socket. While running a number of small workloads on
> the same host, they penalise each other and cause performance degradation.
>
> Also, the root cause might be a single workload that consumes much more
> resources than the others. It often happens on a cloud service where
> different workloads share the same computing resource.
>
> On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> entries), after running iperf3 in different netns, creating 24Mi sockets
> without data transfer in the root netns causes about 10% performance
> regression for the iperf3's connection.
>
> thash_entries sockets length Gbps
> 524288 1 1 50.7
> 24Mi 48 45.1
>
> It is basically related to the length of the list of each hash bucket.
> For testing purposes to see how performance drops along the length,
> I set 131072 (1Mi / 8) to thash_entries, and here's the result.
>
> thash_entries sockets length Gbps
> 131072 1 1 50.7
> 1Mi 8 49.9
> 2Mi 16 48.9
> 4Mi 32 47.3
> 8Mi 64 44.6
> 16Mi 128 40.6
> 24Mi 192 36.3
> 32Mi 256 32.5
> 40Mi 320 27.0
> 48Mi 384 25.0
>
> To resolve the socket lookup degradation, we introduce an optional
> per-netns hash table for TCP and UDP. With a smaller hash table, we
> can look up sockets faster and isolate noisy neighbours. Also, we can
> reduce lock contention.
>
> We can control and check the hash size via sysctl knobs. It requires
> some tuning based on workloads, so the per-netns hash table is disabled
> by default.
>
> # dmesg | cut -d ' ' -f 5- | grep "established hash"
> TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
>
> # sysctl net.ipv4.tcp_ehash_entries
> net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
>
> # sysctl net.ipv4.tcp_child_ehash_entries
> net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
>
> # ip netns add test1
> # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
> net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
>
> # sysctl -w net.ipv4.tcp_child_ehash_entries=100
> net.ipv4.tcp_child_ehash_entries = 100
>
> # sysctl net.ipv4.tcp_child_ehash_entries
> net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n
>
> # ip netns add test2
> # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
> net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash
>
> [ UDP has the same interface as udp_hash_entries and
> udp_child_hash_entries. ]
>
> When creating per-netns concurrently with different sizes, we can
> guarantee the size by doing one of these ways.
>
> 1) Share the global hash table and create per-netns one
>
> First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
> netns sysctl knobs where we can safely change tcp_child_ehash_entries
> and clone()/unshare() to create a per-netns hash table.
>
> 2) Lock the sysctl knob
>
This is orthogonal.
Your series should have been split in three really.
I do not want to discuss the merit of re-instating LOCK_MAND :/
> We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
> read/write on sysctl knobs.
>
> For details, please see each patch.
>
> patch 1 - 3: mandatory lock support for sysctl (fs stuff)
> patch 4 - 7: prep patch for per-netns TCP ehash
> patch 8: add per-netns TCP ehash
> patch 9 - 12: prep patch for per-netns UDP hash table
> patch 13: add per-netns UDP hash table
>
>
> Kuniyuki Iwashima (13):
> fs/lock: Revive LOCK_MAND.
> sysctl: Support LOCK_MAND for read/write.
> selftest: sysctl: Add test for flock(LOCK_MAND).
> net: Introduce init2() for pernet_operations.
> tcp: Clean up some functions.
> tcp: Set NULL to sk->sk_prot->h.hashinfo.
> tcp: Access &tcp_hashinfo via net.
> tcp: Introduce optional per-netns ehash.
> udp: Clean up some functions.
> udp: Set NULL to sk->sk_prot->h.udp_table.
> udp: Set NULL to udp_seq_afinfo.udp_table.
> udp: Access &udp_table via net.
> udp: Introduce optional per-netns hash table.
>
> Documentation/networking/ip-sysctl.rst | 40 +++++
> .../chelsio/inline_crypto/chtls/chtls_cm.c | 5 +-
> .../mellanox/mlx5/core/en_accel/ktls_rx.c | 5 +-
> .../net/ethernet/netronome/nfp/crypto/tls.c | 5 +-
> fs/locks.c | 83 ++++++---
> fs/proc/proc_sysctl.c | 25 ++-
> include/linux/fs.h | 1 +
> include/net/inet_hashtables.h | 16 ++
> include/net/net_namespace.h | 3 +
> include/net/netns/ipv4.h | 4 +
> include/uapi/asm-generic/fcntl.h | 5 -
> net/core/filter.c | 9 +-
> net/core/net_namespace.c | 18 +-
> net/dccp/proto.c | 2 +
> net/ipv4/af_inet.c | 2 +-
> net/ipv4/esp4.c | 3 +-
> net/ipv4/inet_connection_sock.c | 25 ++-
> net/ipv4/inet_hashtables.c | 102 ++++++++---
> net/ipv4/inet_timewait_sock.c | 4 +-
> net/ipv4/netfilter/nf_socket_ipv4.c | 2 +-
> net/ipv4/netfilter/nf_tproxy_ipv4.c | 17 +-
> net/ipv4/sysctl_net_ipv4.c | 113 ++++++++++++
> net/ipv4/tcp.c | 1 +
> net/ipv4/tcp_diag.c | 18 +-
> net/ipv4/tcp_ipv4.c | 122 +++++++++----
> net/ipv4/tcp_minisocks.c | 2 +-
> net/ipv4/udp.c | 164 ++++++++++++++----
> net/ipv4/udp_diag.c | 6 +-
> net/ipv4/udp_offload.c | 5 +-
> net/ipv6/esp6.c | 3 +-
> net/ipv6/inet6_hashtables.c | 4 +-
> net/ipv6/netfilter/nf_socket_ipv6.c | 2 +-
> net/ipv6/netfilter/nf_tproxy_ipv6.c | 5 +-
> net/ipv6/tcp_ipv6.c | 30 +++-
> net/ipv6/udp.c | 31 ++--
> net/ipv6/udp_offload.c | 5 +-
> net/mptcp/mptcp_diag.c | 7 +-
> tools/testing/selftests/sysctl/.gitignore | 2 +
> tools/testing/selftests/sysctl/Makefile | 9 +-
> tools/testing/selftests/sysctl/sysctl_flock.c | 157 +++++++++++++++++
> 40 files changed, 854 insertions(+), 208 deletions(-)
> create mode 100644 tools/testing/selftests/sysctl/.gitignore
> create mode 100644 tools/testing/selftests/sysctl/sysctl_flock.c
>
> --
> 2.30.2
>
Powered by blists - more mailing lists