lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20221109190906.25023-1-kuniyu@amazon.com>
Date:   Wed, 9 Nov 2022 11:09:06 -0800
From:   Kuniyuki Iwashima <kuniyu@...zon.com>
To:     <pabeni@...hat.com>
CC:     <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
        <kuni1840@...il.com>, <kuniyu@...zon.com>, <netdev@...r.kernel.org>
Subject: Re: [PATCH v1 net-next 6/6] udp: Introduce optional per-netns hash table.

From:   Paolo Abeni <pabeni@...hat.com>
Date:   Wed, 09 Nov 2022 10:02:49 +0100
> Hello,
> 
> On Mon, 2022-11-07 at 10:41 -0800, Kuniyuki Iwashima wrote:
> > From:   Paolo Abeni <pabeni@...hat.com>
> > Date:   Mon, 07 Nov 2022 11:03:38 +0100
> > > On Fri, 2022-11-04 at 12:06 -0700, Kuniyuki Iwashima wrote:
> > > > The maximum hash table size is 64K due to the nature of the protocol. [0]
> > > > It's smaller than TCP, and fewer sockets can cause a performance drop.
> > > > 
> > > > On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
> > > > different netns, creating 32Mi sockets without data transfer in the root
> > > > netns causes regression for the iperf3's connection.
> > > > 
> > > >   uhash_entries		sockets		length		Gbps
> > > > 	    64K		      1		     1		5.69
> > > > 			    1Mi		    16		5.27
> > > > 			    2Mi		    32		4.90
> > > > 			    4Mi		    64		4.09
> > > > 			    8Mi		   128		2.96
> > > > 			   16Mi		   256		2.06
> > > > 			   32Mi		   512		1.12
> > > > 
> > > > The per-netns hash table breaks the lengthy lists into shorter ones.  It is
> > > > useful on a multi-tenant system with thousands of netns.  With smaller hash
> > > > tables, we can look up sockets faster, isolate noisy neighbours, and reduce
> > > > lock contention.
> > > > 
> > > > The max size of the per-netns table is 64K as well.  This is because the
> > > > possible hash range by udp_hashfn() always fits in 64K within the same
> > > > netns and we cannot make full use of the whole buckets larger than 64K.
> > > > 
> > > >   /* 0 < num < 64K  ->  X < hash < X + 64K */
> > > >   (num + net_hash_mix(net)) & mask;
> > > > 
> > > > The sysctl usage is the same with TCP:
> > > > 
> > > >   $ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
> > > >   UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
> > > > 
> > > >   # sysctl net.ipv4.udp_hash_entries
> > > >   net.ipv4.udp_hash_entries = 65536  # can be changed by uhash_entries
> > > > 
> > > >   # sysctl net.ipv4.udp_child_hash_entries
> > > >   net.ipv4.udp_child_hash_entries = 0  # disabled by default
> > > > 
> > > >   # ip netns add test1
> > > >   # ip netns exec test1 sysctl net.ipv4.udp_hash_entries
> > > >   net.ipv4.udp_hash_entries = -65536  # share the global table
> > > > 
> > > >   # sysctl -w net.ipv4.udp_child_hash_entries=100
> > > >   net.ipv4.udp_child_hash_entries = 100
> > > > 
> > > >   # ip netns add test2
> > > >   # ip netns exec test2 sysctl net.ipv4.udp_hash_entries
> > > >   net.ipv4.udp_hash_entries = 128  # own a per-netns table with 2^n buckets
> > > > 
> > > > We could optimise the hash table lookup/iteration further by removing
> > > > the netns comparison for the per-netns one in the future.  Also, we
> > > > could optimise the sparse udp_hslot layout by putting it in udp_table.
> > > > 
> > > > [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
> > > > 
> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > > > ---
> > > >  Documentation/networking/ip-sysctl.rst |  27 ++++++
> > > >  include/linux/udp.h                    |   2 +
> > > >  include/net/netns/ipv4.h               |   2 +
> > > >  net/ipv4/sysctl_net_ipv4.c             |  38 ++++++++
> > > >  net/ipv4/udp.c                         | 119 ++++++++++++++++++++++---
> > > >  5 files changed, 178 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> > > > index 815efc89ad73..ea788ef4def0 100644
> > > > --- a/Documentation/networking/ip-sysctl.rst
> > > > +++ b/Documentation/networking/ip-sysctl.rst
> > > > @@ -1177,6 +1177,33 @@ udp_rmem_min - INTEGER
> > > >  udp_wmem_min - INTEGER
> > > >  	UDP does not have tx memory accounting and this tunable has no effect.
> > > >  
> > > > +udp_hash_entries - INTEGER
> > > > +	Show the number of hash buckets for UDP sockets in the current
> > > > +	networking namespace.
> > > > +
> > > > +	A negative value means the networking namespace does not own its
> > > > +	hash buckets and shares the initial networking namespace's one.
> > > > +
> > > > +udp_child_ehash_entries - INTEGER
> > > > +	Control the number of hash buckets for UDP sockets in the child
> > > > +	networking namespace, which must be set before clone() or unshare().
> > > > +
> > > > +	If the value is not 0, the kernel uses a value rounded up to 2^n
> > > > +	as the actual hash bucket size.  0 is a special value, meaning
> > > > +	the child networking namespace will share the initial networking
> > > > +	namespace's hash buckets.
> > > > +
> > > > +	Note that the child will use the global one in case the kernel
> > > > +	fails to allocate enough memory.  In addition, the global hash
> > > > +	buckets are spread over available NUMA nodes, but the allocation
> > > > +	of the child hash table depends on the current process's NUMA
> > > > +	policy, which could result in performance differences.
> > > > +
> > > > +	Possible values: 0, 2^n (n: 0 - 16 (64K))
> > > 
> > > If you constraint the non-zero minum size to UDP_HTABLE_SIZE_MIN - not
> > > sure if that makes sense - than you could avoid dynamically allocating
> > > the bitmap in udp_lib_get_port(). 
> > 
> > Yes, but 256 didn't match for our case.  Also, I was thinking like
> > this not to affect the global table case.  Which do you prefer ?
> > 
> > uncompiled code:
> > 
> > unsigned long *udp_get_bitmap(struct udp_table *udptable,
> > 			      unsigned long *bitmap_stack)
> > {
> > 	unsigned long *bitmap;
> > 
> > 	if (udptable == &udp_table)
> > 		return bitmap_stack;
> > 
> > 	/* UDP_HTABLE_SIZE_MIN */
> > 	if (udptable->log >= 8)
> > 		return bitmap_stack;
> > 
> > 	bitmap = bitmap_alloc(udp_bitmap_size(udptable));
> > 	if (!bitmap)
> > 		return bitmap_stack;
> > 
> > 	return bitmap;
> > }
> > 
> > void udp_free_bitmap(unsigned long *bitmap,
> > 		     unsigned long *bitmap_stack)
> > {
> > 	if (bitmap != bitmap_stack)
> > 		bitmap_free(bitmap);
> > }
> 
> I'm sorry for the latency. The thing that makes me feel not so
> comfortable is the new allocation required in udp_lib_get_port().
> 
> The above will avoid such allocation for the common case, but not for
> every caller.
> 
> What about adding a 'unsigned long *bitmap' field to udp_table, setting
> it to the global bitmap for the main table, and allocating enough
> storage to include even the bitmap when allocating the per netns
> udp_table? e.g.

I like the idea, much cleaner!

> 
> L1_CACHE_ALIGN(hash_entries * 2 * sizeof(struct udp_hslot)) *
> BITS_TO_LONGS(udp_bitmap_size(udptable))

Maybe this ? as we don't use bitmap for hslot2.

L1_CACHE_ALIGN(hash_entries * 2 * sizeof(struct udp_hslot) +
               hash_entries * BITS_TO_LONGS(udp_bitmap_size(udptable)))

Thank you!

> 
> WDYT?
> 
> thanks!
> 
> Paolo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ