netdev - Re: [PATCH v5 net-next 6/6] tcp: Introduce optional per-netns ehash.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220907214644.34344-1-kuniyu@amazon.com>
Date:   Wed, 7 Sep 2022 14:46:44 -0700
From:   Kuniyuki Iwashima <kuniyu@...zon.com>
To:     <edumazet@...gle.com>
CC:     <davem@...emloft.net>, <kuba@...nel.org>, <kuni1840@...il.com>,
        <kuniyu@...zon.com>, <netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v5 net-next 6/6] tcp: Introduce optional per-netns ehash.

From:   Eric Dumazet <edumazet@...gle.com>
Date:   Wed, 7 Sep 2022 13:55:08 -0700
> On Tue, Sep 6, 2022 at 5:57 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> >
> > The more sockets we have in the hash table, the longer we spend looking
> > up the socket.  While running a number of small workloads on the same
> > host, they penalise each other and cause performance degradation.
> >
> > The root cause might be a single workload that consumes much more
> > resources than the others.  It often happens on a cloud service where
> > different workloads share the same computing resource.
> >
> > On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> > entries), after running iperf3 in different netns, creating 24Mi sockets
> > without data transfer in the root netns causes about 10% performance
> > regression for the iperf3's connection.
> >
> >  thash_entries          sockets         length          Gbps
> >         524288                1              1          50.7
> >                            24Mi             48          45.1
> >
> > It is basically related to the length of the list of each hash bucket.
> > For testing purposes to see how performance drops along the length,
> > I set 131072 (1Mi / 8) to thash_entries, and here's the result.
> >
> >  thash_entries          sockets         length          Gbps
> >         131072                1              1          50.7
> >                             1Mi              8          49.9
> >                             2Mi             16          48.9
> >                             4Mi             32          47.3
> >                             8Mi             64          44.6
> >                            16Mi            128          40.6
> >                            24Mi            192          36.3
> >                            32Mi            256          32.5
> >                            40Mi            320          27.0
> >                            48Mi            384          25.0
> >
> > To resolve the socket lookup degradation, we introduce an optional
> > per-netns hash table for TCP, but it's just ehash, and we still share
> > the global bhash, bhash2 and lhash2.
> >
> > With a smaller ehash, we can look up non-listener sockets faster and
> > isolate such noisy neighbours.  In addition, we can reduce lock contention.
> >
> > We can control the ehash size by a new sysctl knob.  However, depending
> > on workloads, it will require very sensitive tuning, so we disable the
> > feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
> > we can fall back to using the global ehash in case we fail to allocate
> > enough memory for a new ehash.  The maximum size is 16Mi, which is large
> > enough that even if we have 48Mi sockets, the average list length is 3,
> > and regression would be less than 1%.
> >
> > We can check the current ehash size by another read-only sysctl knob,
> > net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
> > the global ehash (per-netns ehash is disabled or failed to allocate
> > memory).
> >
> >   # dmesg | cut -d ' ' -f 5- | grep "established hash"
> >   TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
> >
> >   # sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
> >
> >   # sysctl net.ipv4.tcp_child_ehash_entries
> >   net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
> >
> >   # ip netns add test1
> >   # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
> >
> >   # sysctl -w net.ipv4.tcp_child_ehash_entries=100
> >   net.ipv4.tcp_child_ehash_entries = 100
> >
> >   # ip netns add test2
> >   # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
> >
> > When more than two processes in the same netns create per-netns ehash
> > concurrently with different sizes, we need to guarantee the size in
> > one of the following ways:
> >
> >   1) Share the global ehash and create per-netns ehash
> >
> >   First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
> >   netns sysctl knobs where we can safely change tcp_child_ehash_entries
> >   and clone()/unshare() to create a per-netns ehash.
> >
> >   2) Control write on sysctl by BPF
> >
> >   We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
> >   sysctl knobs.
> >
> > Note the default values of two sysctl knobs depend on the ehash size and
> > should be tuned carefully:
> >
> >   tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
> >   tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
> >
> > As a bonus, we can dismantle netns faster.  Currently, while destroying
> > netns, we call inet_twsk_purge(), which walks through the global ehash.
> > It can be potentially big because it can have many sockets other than
> > TIME_WAIT in all netns.  Splitting ehash changes that situation, where
> > it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
> > in each netns.
> >
> > With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
> > to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
> > Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
> > keep it protocol-family-independent.
> >
> > In the future, we could optimise ehash lookup/iteration further by removing
> > netns comparison for the per-netns ehash.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> 
> ...
> 
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index c440de998910..e94e1316fcc3 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> >         return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
> > +
> > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
> > +                                                unsigned int ehash_entries)
> > +{
> > +       struct inet_hashinfo *new_hashinfo;
> > +       int i;
> > +
> > +       new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL);
> > +       if (!new_hashinfo)
> > +               goto err;
> > +
> > +       new_hashinfo->ehash = kvmalloc_array(ehash_entries,
> > +                                            sizeof(struct inet_ehash_bucket),
> > +                                            GFP_KERNEL_ACCOUNT);
> 
> Note that in current kernel,  init_net ehash table is using hugepages:
> 
> # dmesg | grep "TCP established hash table"
> [   17.512756] TCP established hash table entries: 524288 (order: 10,
> 4194304 bytes, vmalloc hugepage)
> 
> As this is very desirable, I would suggest using the following to
> avoid possible performance regression,
> especially for workload wanting a big ehash, as hinted by your changelog.
> 
> new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct
> inet_ehash_bucket), GFP_KERNEL_ACCOUNT);
> 
> (No overflow can happen in the multiply, as ehash_entries < 16M)

Do we need 'get_order(size) >= MAX_ORDER' check or just use it?
Due to the test in alloc_large_system_hash(), on a machine where the
calculted bucket size is not large enough, we don't use hugepages for
init_net.


> Another point is that on NUMA, init_net ehash table is spread over
> available NUMA nodes.
> 
> While net_pernet_hashinfo_alloc() will allocate pages depending on
> current process NUMA policy.
> 
> Maybe worth noting this in the changelog, because it is very possible
> that new nets
> is created with default NUMA policy, and depending on which cpu
> current thread is
> running, hash table will fully reside on a 'random' node, with very
> different performance
> results for highly optimized networking applications.

Sounds great!
But I'm not familiar with mm, so let me confirm a bit more.

It seems vmalloc_huge() always pass NUMA_NO_NODE to __vmalloc_node_range(),
so if we use vmalloc_huge(), the per-net ehash will be spread on each NUMA
nodes unless vmap_allow_huge is disabled in the kernel parameters, right?

Or, even if we use vmalloc_huge(), the ehash could be controlled by the
current process's NUMA policy?  (Sorry I'm not sure where the policy is
applied..)


> > +       if (!new_hashinfo->ehash)
> > +               goto free_hashinfo;
> > +
> > +       new_hashinfo->ehash_mask = ehash_entries - 1;
> > +
> > +       if (inet_ehash_locks_alloc(new_hashinfo))
> > +               goto free_ehash;
> > +
> > +       for (i = 0; i < ehash_entries; i++)
> > +               INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
> > +
> > +       new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep;
> > +       new_hashinfo->bhash = hashinfo->bhash;
> > +       new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep;
> > +       new_hashinfo->bhash2 = hashinfo->bhash2;
> > +       new_hashinfo->bhash_size = hashinfo->bhash_size;
> > +
> > +       new_hashinfo->lhash2_mask = hashinfo->lhash2_mask;
> > +       new_hashinfo->lhash2 = hashinfo->lhash2;
> > +