lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 7 Nov 2022 10:41:29 -0800 From: Kuniyuki Iwashima <kuniyu@...zon.com> To: <pabeni@...hat.com> CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>, <kuni1840@...il.com>, <kuniyu@...zon.com>, <netdev@...r.kernel.org> Subject: Re: [PATCH v1 net-next 6/6] udp: Introduce optional per-netns hash table. From: Paolo Abeni <pabeni@...hat.com> Date: Mon, 07 Nov 2022 11:03:38 +0100 > On Fri, 2022-11-04 at 12:06 -0700, Kuniyuki Iwashima wrote: > > The maximum hash table size is 64K due to the nature of the protocol. [0] > > It's smaller than TCP, and fewer sockets can cause a performance drop. > > > > On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in > > different netns, creating 32Mi sockets without data transfer in the root > > netns causes regression for the iperf3's connection. > > > > uhash_entries sockets length Gbps > > 64K 1 1 5.69 > > 1Mi 16 5.27 > > 2Mi 32 4.90 > > 4Mi 64 4.09 > > 8Mi 128 2.96 > > 16Mi 256 2.06 > > 32Mi 512 1.12 > > > > The per-netns hash table breaks the lengthy lists into shorter ones. It is > > useful on a multi-tenant system with thousands of netns. With smaller hash > > tables, we can look up sockets faster, isolate noisy neighbours, and reduce > > lock contention. > > > > The max size of the per-netns table is 64K as well. This is because the > > possible hash range by udp_hashfn() always fits in 64K within the same > > netns and we cannot make full use of the whole buckets larger than 64K. > > > > /* 0 < num < 64K -> X < hash < X + 64K */ > > (num + net_hash_mix(net)) & mask; > > > > The sysctl usage is the same with TCP: > > > > $ dmesg | cut -d ' ' -f 6- | grep "UDP hash" > > UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc) > > > > # sysctl net.ipv4.udp_hash_entries > > net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries > > > > # sysctl net.ipv4.udp_child_hash_entries > > net.ipv4.udp_child_hash_entries = 0 # disabled by default > > > > # ip netns add test1 > > # ip netns exec test1 sysctl net.ipv4.udp_hash_entries > > net.ipv4.udp_hash_entries = -65536 # share the global table > > > > # sysctl -w net.ipv4.udp_child_hash_entries=100 > > net.ipv4.udp_child_hash_entries = 100 > > > > # ip netns add test2 > > # ip netns exec test2 sysctl net.ipv4.udp_hash_entries > > net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets > > > > We could optimise the hash table lookup/iteration further by removing > > the netns comparison for the per-netns one in the future. Also, we > > could optimise the sparse udp_hslot layout by putting it in udp_table. > > > > [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/ > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com> > > --- > > Documentation/networking/ip-sysctl.rst | 27 ++++++ > > include/linux/udp.h | 2 + > > include/net/netns/ipv4.h | 2 + > > net/ipv4/sysctl_net_ipv4.c | 38 ++++++++ > > net/ipv4/udp.c | 119 ++++++++++++++++++++++--- > > 5 files changed, 178 insertions(+), 10 deletions(-) > > > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst > > index 815efc89ad73..ea788ef4def0 100644 > > --- a/Documentation/networking/ip-sysctl.rst > > +++ b/Documentation/networking/ip-sysctl.rst > > @@ -1177,6 +1177,33 @@ udp_rmem_min - INTEGER > > udp_wmem_min - INTEGER > > UDP does not have tx memory accounting and this tunable has no effect. > > > > +udp_hash_entries - INTEGER > > + Show the number of hash buckets for UDP sockets in the current > > + networking namespace. > > + > > + A negative value means the networking namespace does not own its > > + hash buckets and shares the initial networking namespace's one. > > + > > +udp_child_ehash_entries - INTEGER > > + Control the number of hash buckets for UDP sockets in the child > > + networking namespace, which must be set before clone() or unshare(). > > + > > + If the value is not 0, the kernel uses a value rounded up to 2^n > > + as the actual hash bucket size. 0 is a special value, meaning > > + the child networking namespace will share the initial networking > > + namespace's hash buckets. > > + > > + Note that the child will use the global one in case the kernel > > + fails to allocate enough memory. In addition, the global hash > > + buckets are spread over available NUMA nodes, but the allocation > > + of the child hash table depends on the current process's NUMA > > + policy, which could result in performance differences. > > + > > + Possible values: 0, 2^n (n: 0 - 16 (64K)) > > If you constraint the non-zero minum size to UDP_HTABLE_SIZE_MIN - not > sure if that makes sense - than you could avoid dynamically allocating > the bitmap in udp_lib_get_port(). Yes, but 256 didn't match for our case. Also, I was thinking like this not to affect the global table case. Which do you prefer ? uncompiled code: unsigned long *udp_get_bitmap(struct udp_table *udptable, unsigned long *bitmap_stack) { unsigned long *bitmap; if (udptable == &udp_table) return bitmap_stack; /* UDP_HTABLE_SIZE_MIN */ if (udptable->log >= 8) return bitmap_stack; bitmap = bitmap_alloc(udp_bitmap_size(udptable)); if (!bitmap) return bitmap_stack; return bitmap; } void udp_free_bitmap(unsigned long *bitmap, unsigned long *bitmap_stack) { if (bitmap != bitmap_stack) bitmap_free(bitmap); }
Powered by blists - more mailing lists