[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77aa882a0ac72cf434ecb44590f83ab9ece3b2f8.camel@redhat.com>
Date: Mon, 07 Nov 2022 11:03:38 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org
Subject: Re: [PATCH v1 net-next 6/6] udp: Introduce optional per-netns hash
table.
On Fri, 2022-11-04 at 12:06 -0700, Kuniyuki Iwashima wrote:
> The maximum hash table size is 64K due to the nature of the protocol. [0]
> It's smaller than TCP, and fewer sockets can cause a performance drop.
>
> On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
> different netns, creating 32Mi sockets without data transfer in the root
> netns causes regression for the iperf3's connection.
>
> uhash_entries sockets length Gbps
> 64K 1 1 5.69
> 1Mi 16 5.27
> 2Mi 32 4.90
> 4Mi 64 4.09
> 8Mi 128 2.96
> 16Mi 256 2.06
> 32Mi 512 1.12
>
> The per-netns hash table breaks the lengthy lists into shorter ones. It is
> useful on a multi-tenant system with thousands of netns. With smaller hash
> tables, we can look up sockets faster, isolate noisy neighbours, and reduce
> lock contention.
>
> The max size of the per-netns table is 64K as well. This is because the
> possible hash range by udp_hashfn() always fits in 64K within the same
> netns and we cannot make full use of the whole buckets larger than 64K.
>
> /* 0 < num < 64K -> X < hash < X + 64K */
> (num + net_hash_mix(net)) & mask;
>
> The sysctl usage is the same with TCP:
>
> $ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
> UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
>
> # sysctl net.ipv4.udp_hash_entries
> net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries
>
> # sysctl net.ipv4.udp_child_hash_entries
> net.ipv4.udp_child_hash_entries = 0 # disabled by default
>
> # ip netns add test1
> # ip netns exec test1 sysctl net.ipv4.udp_hash_entries
> net.ipv4.udp_hash_entries = -65536 # share the global table
>
> # sysctl -w net.ipv4.udp_child_hash_entries=100
> net.ipv4.udp_child_hash_entries = 100
>
> # ip netns add test2
> # ip netns exec test2 sysctl net.ipv4.udp_hash_entries
> net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets
>
> We could optimise the hash table lookup/iteration further by removing
> the netns comparison for the per-netns one in the future. Also, we
> could optimise the sparse udp_hslot layout by putting it in udp_table.
>
> [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> ---
> Documentation/networking/ip-sysctl.rst | 27 ++++++
> include/linux/udp.h | 2 +
> include/net/netns/ipv4.h | 2 +
> net/ipv4/sysctl_net_ipv4.c | 38 ++++++++
> net/ipv4/udp.c | 119 ++++++++++++++++++++++---
> 5 files changed, 178 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 815efc89ad73..ea788ef4def0 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1177,6 +1177,33 @@ udp_rmem_min - INTEGER
> udp_wmem_min - INTEGER
> UDP does not have tx memory accounting and this tunable has no effect.
>
> +udp_hash_entries - INTEGER
> + Show the number of hash buckets for UDP sockets in the current
> + networking namespace.
> +
> + A negative value means the networking namespace does not own its
> + hash buckets and shares the initial networking namespace's one.
> +
> +udp_child_ehash_entries - INTEGER
> + Control the number of hash buckets for UDP sockets in the child
> + networking namespace, which must be set before clone() or unshare().
> +
> + If the value is not 0, the kernel uses a value rounded up to 2^n
> + as the actual hash bucket size. 0 is a special value, meaning
> + the child networking namespace will share the initial networking
> + namespace's hash buckets.
> +
> + Note that the child will use the global one in case the kernel
> + fails to allocate enough memory. In addition, the global hash
> + buckets are spread over available NUMA nodes, but the allocation
> + of the child hash table depends on the current process's NUMA
> + policy, which could result in performance differences.
> +
> + Possible values: 0, 2^n (n: 0 - 16 (64K))
If you constraint the non-zero minum size to UDP_HTABLE_SIZE_MIN - not
sure if that makes sense - than you could avoid dynamically allocating
the bitmap in udp_lib_get_port().
Cheers,
Paolo
Powered by blists - more mailing lists