netdev - Re: [PATCH net-next] tcp: bring back NUMA dispersion in inet_ehash_locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBh4LrTqp6KA2DqwamP5e863vQfnTA8KrZ15+mBnPS7dQ@mail.gmail.com>
Date: Thu, 6 Mar 2025 16:50:00 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, davem@...emloft.net, eric.dumazet@...il.com, 
	horms@...nel.org, kernelxing@...cent.com, kuba@...nel.org, 
	ncardwell@...gle.com, netdev@...r.kernel.org, pabeni@...hat.com
Subject: Re: [PATCH net-next] tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()

On Thu, Mar 6, 2025 at 4:18 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Mar 6, 2025 at 9:04 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 3:26 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, Mar 6, 2025 at 7:35 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > On Thu, Mar 6, 2025 at 2:22 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@...il.com>
> > > > > Date: Thu, 6 Mar 2025 12:59:03 +0800
> > > > > > On Thu, Mar 6, 2025 at 12:12 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > > > > >
> > > > > > > From: Jason Xing <kerneljasonxing@...il.com>
> > > > > > > Date: Thu, 6 Mar 2025 11:35:27 +0800
> > > > > > > > On Wed, Mar 5, 2025 at 9:06 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > > > > > >
> > > > > > > > > We have platforms with 6 NUMA nodes and 480 cpus.
> > > > > > > > >
> > > > > > > > > inet_ehash_locks_alloc() currently allocates a single 64KB page
> > > > > > > > > to hold all ehash spinlocks. This adds more pressure on a single node.
> > > > > > > > >
> > > > > > > > > Change inet_ehash_locks_alloc() to use vmalloc() to spread
> > > > > > > > > the spinlocks on all online nodes, driven by NUMA policies.
> > > > > > > > >
> > > > > > > > > At boot time, NUMA policy is interleave=all, meaning that
> > > > > > > > > tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.
> > > > > > > > >
> > > > > > > > > Tested:
> > > > > > > > >
> > > > > > > > > lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
> > > > > > > > > lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
> > > > > > > > > 0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
> > > > > > > > > 0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > > > > > > >
> > > > > > > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > > > > > >
> > > > > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > >  net/ipv4/inet_hashtables.c | 37 ++++++++++++++++++++++++++-----------
> > > > > > > > >  1 file changed, 26 insertions(+), 11 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > > > > > > > > index 9bfcfd016e18275fb50fea8d77adc8a64fb12494..2b4a588247639e0c7b2e70d1fc9b3b9b60256ef7 100644
> > > > > > > > > --- a/net/ipv4/inet_hashtables.c
> > > > > > > > > +++ b/net/ipv4/inet_hashtables.c
> > > > > > > > > @@ -1230,22 +1230,37 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> > > > > > > > >  {
> > > > > > > > >         unsigned int locksz = sizeof(spinlock_t);
> > > > > > > > >         unsigned int i, nblocks = 1;
> > > > > > > > > +       spinlock_t *ptr = NULL;
> > > > > > > > >
> > > > > > > > > -       if (locksz != 0) {
> > > > > > > > > -               /* allocate 2 cache lines or at least one spinlock per cpu */
> > > > > > > > > -               nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U);
> > > > > > > > > -               nblocks = roundup_pow_of_two(nblocks * num_possible_cpus());
> > > > > > > > > +       if (locksz == 0)
> > > > > > > > > +               goto set_mask;
> > > > > > > > >
> > > > > > > > > -               /* no more locks than number of hash buckets */
> > > > > > > > > -               nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > > > +       /* Allocate 2 cache lines or at least one spinlock per cpu. */
> > > > > > > > > +       nblocks = max(2U * L1_CACHE_BYTES / locksz, 1U) * num_possible_cpus();
> > > > > > > > >
> > > > > > > > > -               hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > > > > > > > > -               if (!hashinfo->ehash_locks)
> > > > > > > > > -                       return -ENOMEM;
> > > > > > > > > +       /* At least one page per NUMA node. */
> > > > > > > > > +       nblocks = max(nblocks, num_online_nodes() * PAGE_SIZE / locksz);
> > > > > > > > > +
> > > > > > > > > +       nblocks = roundup_pow_of_two(nblocks);
> > > > > > > > > +
> > > > > > > > > +       /* No more locks than number of hash buckets. */
> > > > > > > > > +       nblocks = min(nblocks, hashinfo->ehash_mask + 1);
> > > > > > > > >
> > > > > > > > > -               for (i = 0; i < nblocks; i++)
> > > > > > > > > -                       spin_lock_init(&hashinfo->ehash_locks[i]);
> > > > > > > > > +       if (num_online_nodes() > 1) {
> > > > > > > > > +               /* Use vmalloc() to allow NUMA policy to spread pages
> > > > > > > > > +                * on all available nodes if desired.
> > > > > > > > > +                */
> > > > > > > > > +               ptr = vmalloc_array(nblocks, locksz);
> > > > > > > >
> > > > > > > > I wonder if at this point the memory shortage occurs, is it necessary
> > > > > > > > to fall back to kvmalloc() later
> > > > > > >
> > > > > > > If ptr is NULL here, kvmalloc_array() is called below.
> > > > > >
> > > > > > My point is why not return with -ENOMEM directly? Or else It looks meaningless.
> > > > > >
> > > > >
> > > > > Ah, I misread.  I'm not sure how likely such a case happens, but I
> > > > > think vmalloc() and kmalloc() failure do not always correlate, the
> > > > > former uses node_alloc() and the latter use the page allocator.
> > > >
> > > > Sure, it is unlikely to happen.
> > > >
> > > > As to memory allocation, we usually try kmalloc() for less than page
> > > > size memory allocation while vmalloc() for larger one. The same logic
> > > > can be seen in kvmalloc(): try kmalloc() first, then fall back to
> > > > vmalloc(). Since we fail to allocate non-contiguous memory, there is
> > > > no need to try kvmalloc() (which will call kmalloc and vmalloc one
> > > > more round).
> > >
> > > I chose to not add code, because:
> > >
> > >        if (num_online_nodes() > 1) {
> > >                /* Use vmalloc() to allow NUMA policy to spread pages
> > >                 * on all available nodes if desired.
> > >                 */
> > >                ptr = vmalloc_array(nblocks, locksz);
> > >
> > > << adding here a test is pointless, we already have correct code if
> > > ptr == NULLL >>
> > >
> > >        }
> > >        if (!ptr) {
> > >                ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > >                if (!ptr)
> > >                        return -ENOMEM;
> > >         }
> > >
> > >
> > > Sure, this could be written in a different way, but ultimately it is a
> > > matter of taste.
> >
> > Sorry that I didn't make myself clear enough. I mean if
> > vmalloc_array() fails, then it will fall back to kvmalloc_array()
> > which will call kmalloc() or even vmalloc() to allocate memory again.
> > My intention is to return with an error code when the first time
> > allocation fails.
> >
> > Code like this on top of your patch:
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 3edbe2dad8ba..d026918319d2 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -1282,12 +1282,12 @@ int inet_ehash_locks_alloc(struct
> > inet_hashinfo *hashinfo)
> >                  * on all available nodes if desired.
> >                  */
> >                 ptr = vmalloc_array(nblocks, locksz);
> > -       }
> > -       if (!ptr) {
> > +       } else {
> >                 ptr = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
> > -               if (!ptr)
> > -                       return -ENOMEM;
> >         }
> > +       if (!ptr)
> > +               return -ENOMEM;
> > +
>
> I think I understood pretty well, and said it was a matter of taste.
>
> I wish we could move on to more interesting stuff, like the main ehash
> table, which currently
> uses 2 huge pages, they can all land into one physical socket.

Interesting. I will dig into it after next week because next week
netdev will take place :)

Before this, I tried to accelerate transmitting skbs with four numa
nodes like allocating skbs in local numa nodes or something like that,
but it didn't show good throughput. Glad to know you're doing various
challenging tests.

Thanks,
Jason