lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <248395fc-7dd7-3c7d-affc-ced4145c5285@gmail.com>
Date:   Thu, 8 Sep 2022 11:13:07 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Kuniyuki Iwashima <kuniyu@...zon.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>
Cc:     Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org
Subject: Re: [PATCH v6 net-next 0/6] tcp: Introduce optional per-netns ehash.


On 9/7/22 18:10, Kuniyuki Iwashima wrote:
> The more sockets we have in the hash table, the longer we spend looking
> up the socket.  While running a number of small workloads on the same
> host, they penalise each other and cause performance degradation.
>
> The root cause might be a single workload that consumes much more
> resources than the others.  It often happens on a cloud service where
> different workloads share the same computing resource.
>
> On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> entries), after running iperf3 in different netns, creating 24Mi sockets
> without data transfer in the root netns causes about 10% performance
> regression for the iperf3's connection.
>
>   thash_entries		sockets		length		Gbps
> 	524288		      1		     1		50.7
> 			   24Mi		    48		45.1
>
> It is basically related to the length of the list of each hash bucket.
> For testing purposes to see how performance drops along the length,
> I set 131072 (1Mi / 8) to thash_entries, and here's the result.
>
>   thash_entries		sockets		length		Gbps
>          131072		      1		     1		50.7
> 			    1Mi		     8		49.9
> 			    2Mi		    16		48.9
> 			    4Mi		    32		47.3
> 			    8Mi		    64		44.6
> 			   16Mi		   128		40.6
> 			   24Mi		   192		36.3
> 			   32Mi		   256		32.5
> 			   40Mi		   320		27.0
> 			   48Mi		   384		25.0
>
> To resolve the socket lookup degradation, we introduce an optional
> per-netns hash table for TCP, but it's just ehash, and we still share
> the global bhash, bhash2 and lhash2.
>
> With a smaller ehash, we can look up non-listener sockets faster and
> isolate such noisy neighbours.  Also, we can reduce lock contention.
>
> For details, please see the last patch.
>
>    patch 1 - 4: prep for per-netns ehash
>    patch     5: small optimisation for netns dismantle without TIME_WAIT sockets
>    patch     6: add per-netns ehash
>
> Many thanks to Eric Dumazet for reviewing and advising.
>
>
> Changes:
>    v6:
>      * Patch 6
>        * Use vmalloc_huge() in inet_pernet_hashinfo_alloc() and
>          update the changelog and doc about NUMA (Eric Dumazet)
>        * Use kmemdup() in inet_pernet_hashinfo_alloc() (Eric Dumazet)
>        * Use vfree() in inet_pernet_hashinfo_(alloc|free)()
>
>    v5: https://lore.kernel.org/netdev/20220907005534.72876-1-kuniyu@amazon.com/
>      * Patch 2
>        * Keep the tw_refcount base value at 1 (Eric Dumazet)
>        * Add WARN_ON_ONCE() for tw_refcount (Eric Dumazet)
>      * Patch 5
>        * Test tw_refcount against 1 in tcp_twsk_purge()
>
>    v4: https://lore.kernel.org/netdev/20220906162423.44410-1-kuniyu@amazon.com/
>      * Add Patch 2
>      * Patch 1
>        * Add cleanups in tcp_time_wait() and  tcp_v[46]_connect()
>      * Patch 3
>        * /tcp_death_row/s/->/./
>      * Patch 4
>        * Add mellanox and netronome driver changes back (Paolo Abeni, Jakub Kicinski)
>        * /tcp_death_row/s/->/./
>      * Patch 5
>        * Simplify tcp_twsk_purge()
>      * Patch 6
>        * Move inet_pernet_hashinfo_free() into tcp_sk_exit_batch()
>
>    v3: https://lore.kernel.org/netdev/20220830191518.77083-1-kuniyu@amazon.com/
>      * Patch 3
>        * Drop mellanox and netronome driver changes (Eric Dumazet)
>      * Patch 4
>        * Add test results in the changelog
>      * Patch 5
>        * Use roundup_pow_of_two() in tcp_set_hashinfo() (Eric Dumazet)
>        * Remove proc_tcp_child_ehash_entries() and use proc_douintvec_minmax()
>
>    v2: https://lore.kernel.org/netdev/20220829161920.99409-1-kuniyu@amazon.com/
>      * Drop flock() and UDP stuff
>      * Patch 2
>        * Rename inet_get_hashinfo() to tcp_or_dccp_get_hashinfo() (Eric Dumazet)
>      * Patch 4
>        * Remove unnecessary inet_twsk_purge() calls for unshare()
>        * Factorise inet_twsk_purge() calls (Eric Dumazet)
>      * Patch 5
>        * Change max buckets size as 16Mi
>        * Use unsigned int for ehash size (Eric Dumazet)
>        * Use GFP_KERNEL_ACCOUNT for the per-netns ehash allocation (Eric Dumazet)
>        * Use current->nsproxy->net_ns for parent netns (Eric Dumazet)
>
>    v1: https://lore.kernel.org/netdev/20220826000445.46552-1-kuniyu@amazon.com/
>

SGTM, thanks.

For the whole series:

Reviewed-by: Eric Dumazet <edumazet@...gle.com>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ