[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAEPkyyvzULua_MUNQb=up_8Qqg+w3Oq6B9C1JS9gvdrQ@mail.gmail.com>
Date: Thu, 6 Mar 2025 14:35:00 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: davem@...emloft.net, edumazet@...gle.com, eric.dumazet@...il.com,
horms@...nel.org, kernelxing@...cent.com, kuba@...nel.org,
ncardwell@...gle.com, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH net-next] tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()
On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@...il.com>
> Date: Thu, 6 Mar 2025 12:59:03 +0800
> > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@...il.com>
> > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > >
> > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > >
> > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > >
> > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > >
> > > > > Tested:
> > > > >
> > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x000000004e99d30c-0x00000000763f3279 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x00000000fd73a33e-0x0000000004b9a177 36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > 0x00000000db07d7a2-0x00000000ad697d29 8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > 0x00000000d9aec4d1-0x00000000a828b652 69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > >
> > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > >
> > > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > >
> > > Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > >
> > >
> > > >
> > > > > ---
> > > > > net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > 1 file changed, 26 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > {
> > > > > unsigned int locksz = sizeof(spinlock_t);
> > > > > unsigned int i, nblocks = 1;
> > > > > + spinlock_t *ptr = NULL;
> > > > >
> > > > > - if (locksz != 0) {
> > > > > - /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > - nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > - nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > + if (locksz == 0)
> > > > > + goto set_mask;
> > > > >
> > > > > - /* no more locks than number of hash buckets */
> > > > > - nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > + /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > + nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > >
> > > > > - hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > - if (!hashinfo->ehash_locks)
> > > > > - return -ENOMEM;
> > > > > + /* At least one page per NUMA node. */
> > > > > + nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > +
> > > > > + nblocks = roundup_pow_of_two(nblocks);
> > > > > +
> > > > > + /* No more locks than number of hash buckets. */
> > > > > + nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > >
> > > > > - for (i = 0; i < nblocks; i++)
> > > > > - spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > + if (num_online_nodes() > 1) {
> > > > > + /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > + * on all available nodes if desired.
> > > > > + */
> > > > > + ptr = vmalloc_array(nblocks, locksz);
> > > >
> > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > to fall back to kvmalloc() later
> > >
> > > If ptr is NULL here, kvmalloc_array() is called below.
> >
> > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> >
>
> Ah, I misread. I'm not sure how likely such a case happens, but I
> think vmalloc() and kmalloc() failure do not always correlate, the
> former uses node_alloc() and the latter use the page allocator.
Sure, it is unlikely to happen.
As to memory allocation, we usually try kmalloc() for less than page
size memory allocation while vmalloc() for larger one. The same logic
can be seen in kvmalloc(): try kmalloc() first, then fall back to
vmalloc(). Since we fail to allocate non-contiguous memory, there is
no need to try kvmalloc() (which will call kmalloc and vmalloc one
more round).
Thanks,
Jason
Powered by blists - more mailing lists