linux-kernel - Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <018b5652-8006-471d-94d0-d230e4aeef6d@amperemail.onmicrosoft.com>
Date: Wed, 28 Feb 2024 15:05:58 +0800
From: Shijie Huang <shijie@...eremail.onmicrosoft.com>
To: Eric Dumazet <edumazet@...gle.com>,
 Huang Shijie <shijie@...amperecomputing.com>
Cc: kuba@...nel.org, patches@...erecomputing.com, davem@...emloft.net,
 horms@...nel.org, ast@...nel.org, dhowells@...hat.com,
 linyunsheng@...wei.com, aleksander.lobakin@...el.com,
 linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
 cl@...amperecomputing.com
Subject: Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for
 skbuff_fclone_cache


在 2024/2/27 20:55, Eric Dumazet 写道:
> On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> <shijie@...amperecomputing.com> wrote:
>> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
>> the current skbuff_fclone_cache maybe not really allocated, it maybe
>> used an exist old kmem_cache. In NUMA, the fclone allocated by
>> alloc_skb_fclone() maybe in remote node.
> Why is this happening in the first place ? Whab about skb->head ?

I tested the fclone firstly. I did not test others yet.

I did not check the skb->head yet.

But I ever checked the pfrag's page, it is okay.


>
> Jesper patch [1] motivation was not about NUMA., but about
> fragmentation and bulk allocations/freeing.
>
> TCP fclones are not bulk allocated/freed, so I do not understand what
> your patch is doing.
> You need to give more details, and experimental results.

1.) My NUMA machine:

       node 0 (CPU 0 ~ CPU79):

                      CPU 0 ~  CPU 39 are used as memcached's server

                     CPU 40 ~  CPU 79 are used as memcached's client

       node 1 (CPU 80 ~ CPU160):

                      CPU 80 ~  CPU 119 are used as memcached's server

                     CPU 120 ~  CPU 179 are used as memcached's client

    the kernel is linux-next 20240227


  2.) My private patches:

       patch 1 is for slub:

       ---
  mm/slub.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index 5d838ebfa35e..d2ab1e36fd6b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int 
size, unsigned int align,

         s = find_mergeable(size, align, flags, name, ctor);
         if (s) {
+               printk("[%s] origin:%s, shared :%s\n", __func__, name, 
s->name);
                 if (sysfs_slab_alias(s, name))
                         return NULL;

---------

   This patch is used the check which is the sharing kmem_cache for 
"skbuff_fclone_cache".

   I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.

   From my test, the "pool_workqueue" is the real working kmem_cache.

   The "skbuff_fclone_cache" is just a pointer to "pool_workqueue" 
(pwq_cache).


   The following private patch is used to record the fclone allocation:

  ---
  net/ipv4/tcp.c | 19 +++++++++++++++++++
  1 file changed, 19 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c82dc42f57c6..6f31ddcfc017 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t 
*ppos,
  }
  EXPORT_SYMBOL(tcp_splice_read);

+unsigned long right_num, wrong_num;
+static void check_fclone(struct sk_buff *skb)
+{
+       int n = numa_mem_id(); /* current node */
+       int node2 = page_to_nid(virt_to_page((unsigned long) skb));
+
+       if (n != node2) {
+               wrong_num++;
+               if ((wrong_num % 1000) == 999)
+                       printk(KERN_DEBUG "[%s] current:%d, get from 
:%d, (%ld, %ld, %ld)\n",
+                               __func__, n, node2, wrong_num, 
right_num, wrong_num * 100 / (wrong_num + right_num));
+       } else {
+               right_num++;
+               if ((right_num % 1000000) == 9999)
+                       printk("[%s] we received :%ld, %ld\n", __func__, 
right_num, wrong_num);
+       }
+}
+
  struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
                                      bool force_schedule)
  {
@@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock 
*sk, gfp_t gfp,
                         skb_reserve(skb, MAX_TCP_HEADER);
                         skb->ip_summed = CHECKSUM_PARTIAL;
INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
+                       check_fclone(skb);
                         return skb;
                 }
                 __kfree_skb(skb);
--

   Without this v2 patch, I can get the result after the memcached test:

[ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
[ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
[ 1027.804110] [check_fclone] we received :10009999, 7113326

It means nearly 43% fclone is allocated in the remote node.


  With this v2 patch,  I can find the "skbuff_fclone_cache" in 
/proc/slabinfo.

The test result shows below:

[  503.357293] [check_fclone] we received :8009999, 0
[  503.357293] [check_fclone] we received :8009999, 0
[  503.357305] [check_fclone] we received :8009999, 0

After v2 patch, I cannot see the wrong fclone in remote node.


>
> Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> on a dual socket
> host with plenty of available memory.
> (either sk_buff or skb->head being allocated on the other node).

Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?

If so, I guess there is bug in the slub.


> fclones might be allocated from a cpu running on node A, and freed
> from a cpu running on node B.
> Maybe SLUB is not properly handling this case ?

Maybe.



> SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> really help.
>
> I think we need help from mm/slub experts, instead of trying to 'fix'
> networking stacks.

@Christopher

Any idea about this?

Thanks

Huang Shijie

> Perhaps we could augment trace_kmem_cache_alloc() to record/print the nodes
> of the allocated chunk (we already have the cpu number giving us the
> local node).
> That would give us more confidence on any fixes.
>
> BTW SLUB is gone, time to remove FLAG_SKB_NO_MERGE and simply use SLAB_NO_MERGE
>
> [1]
> commit 0a0643164da4a1976455aa12f0a96d08ee290752
> Author: Jesper Dangaard Brouer <hawk@...nel.org>
> Date:   Tue Aug 15 17:17:36 2023 +0200
>
>      net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
>
>
>
>> So set FLAG_SKB_NO_MERGE for skbuff_fclone_cache to fix it.
>>
>> Signed-off-by: Huang Shijie <shijie@...amperecomputing.com>
>> ---
>> v1 --> v2:
>>         set FLAG_SKB_NO_MERGE for skbuff_fclone_cache in initialization.
>>
>> v1: https://lkml.org/lkml/2024/2/20/121
>> ---
>>   net/core/skbuff.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 1f918e602bc4..5e3e130fb57a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5013,7 +5013,8 @@ void __init skb_init(void)
>>          skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
>>                                                  sizeof(struct sk_buff_fclones),
>>                                                  0,
>> -                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>> +                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC|
>> +                                               FLAG_SKB_NO_MERGE,
>>                                                  NULL);
>>          /* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
>>           * struct skb_shared_info is located at the end of skb->head,
>> --
>> 2.40.1
>>