[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <018b5652-8006-471d-94d0-d230e4aeef6d@amperemail.onmicrosoft.com>
Date: Wed, 28 Feb 2024 15:05:58 +0800
From: Shijie Huang <shijie@...eremail.onmicrosoft.com>
To: Eric Dumazet <edumazet@...gle.com>,
Huang Shijie <shijie@...amperecomputing.com>
Cc: kuba@...nel.org, patches@...erecomputing.com, davem@...emloft.net,
horms@...nel.org, ast@...nel.org, dhowells@...hat.com,
linyunsheng@...wei.com, aleksander.lobakin@...el.com,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
cl@...amperecomputing.com
Subject: Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for
skbuff_fclone_cache
在 2024/2/27 20:55, Eric Dumazet 写道:
> On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> <shijie@...amperecomputing.com> wrote:
>> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
>> the current skbuff_fclone_cache maybe not really allocated, it maybe
>> used an exist old kmem_cache. In NUMA, the fclone allocated by
>> alloc_skb_fclone() maybe in remote node.
> Why is this happening in the first place ? Whab about skb->head ?
I tested the fclone firstly. I did not test others yet.
I did not check the skb->head yet.
But I ever checked the pfrag's page, it is okay.
>
> Jesper patch [1] motivation was not about NUMA., but about
> fragmentation and bulk allocations/freeing.
>
> TCP fclones are not bulk allocated/freed, so I do not understand what
> your patch is doing.
> You need to give more details, and experimental results.
1.) My NUMA machine:
node 0 (CPU 0 ~ CPU79):
CPU 0 ~ CPU 39 are used as memcached's server
CPU 40 ~ CPU 79 are used as memcached's client
node 1 (CPU 80 ~ CPU160):
CPU 80 ~ CPU 119 are used as memcached's server
CPU 120 ~ CPU 179 are used as memcached's client
the kernel is linux-next 20240227
2.) My private patches:
patch 1 is for slub:
---
mm/slub.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/slub.c b/mm/slub.c
index 5d838ebfa35e..d2ab1e36fd6b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int
size, unsigned int align,
s = find_mergeable(size, align, flags, name, ctor);
if (s) {
+ printk("[%s] origin:%s, shared :%s\n", __func__, name,
s->name);
if (sysfs_slab_alias(s, name))
return NULL;
---------
This patch is used the check which is the sharing kmem_cache for
"skbuff_fclone_cache".
I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.
From my test, the "pool_workqueue" is the real working kmem_cache.
The "skbuff_fclone_cache" is just a pointer to "pool_workqueue"
(pwq_cache).
The following private patch is used to record the fclone allocation:
---
net/ipv4/tcp.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c82dc42f57c6..6f31ddcfc017 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t
*ppos,
}
EXPORT_SYMBOL(tcp_splice_read);
+unsigned long right_num, wrong_num;
+static void check_fclone(struct sk_buff *skb)
+{
+ int n = numa_mem_id(); /* current node */
+ int node2 = page_to_nid(virt_to_page((unsigned long) skb));
+
+ if (n != node2) {
+ wrong_num++;
+ if ((wrong_num % 1000) == 999)
+ printk(KERN_DEBUG "[%s] current:%d, get from
:%d, (%ld, %ld, %ld)\n",
+ __func__, n, node2, wrong_num,
right_num, wrong_num * 100 / (wrong_num + right_num));
+ } else {
+ right_num++;
+ if ((right_num % 1000000) == 9999)
+ printk("[%s] we received :%ld, %ld\n", __func__,
right_num, wrong_num);
+ }
+}
+
struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
bool force_schedule)
{
@@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock
*sk, gfp_t gfp,
skb_reserve(skb, MAX_TCP_HEADER);
skb->ip_summed = CHECKSUM_PARTIAL;
INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
+ check_fclone(skb);
return skb;
}
__kfree_skb(skb);
--
Without this v2 patch, I can get the result after the memcached test:
[ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
[ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
[ 1027.804110] [check_fclone] we received :10009999, 7113326
It means nearly 43% fclone is allocated in the remote node.
With this v2 patch, I can find the "skbuff_fclone_cache" in
/proc/slabinfo.
The test result shows below:
[ 503.357293] [check_fclone] we received :8009999, 0
[ 503.357293] [check_fclone] we received :8009999, 0
[ 503.357305] [check_fclone] we received :8009999, 0
After v2 patch, I cannot see the wrong fclone in remote node.
>
> Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> on a dual socket
> host with plenty of available memory.
> (either sk_buff or skb->head being allocated on the other node).
Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
If so, I guess there is bug in the slub.
> fclones might be allocated from a cpu running on node A, and freed
> from a cpu running on node B.
> Maybe SLUB is not properly handling this case ?
Maybe.
> SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> really help.
>
> I think we need help from mm/slub experts, instead of trying to 'fix'
> networking stacks.
@Christopher
Any idea about this?
Thanks
Huang Shijie
> Perhaps we could augment trace_kmem_cache_alloc() to record/print the nodes
> of the allocated chunk (we already have the cpu number giving us the
> local node).
> That would give us more confidence on any fixes.
>
> BTW SLUB is gone, time to remove FLAG_SKB_NO_MERGE and simply use SLAB_NO_MERGE
>
> [1]
> commit 0a0643164da4a1976455aa12f0a96d08ee290752
> Author: Jesper Dangaard Brouer <hawk@...nel.org>
> Date: Tue Aug 15 17:17:36 2023 +0200
>
> net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
>
>
>
>> So set FLAG_SKB_NO_MERGE for skbuff_fclone_cache to fix it.
>>
>> Signed-off-by: Huang Shijie <shijie@...amperecomputing.com>
>> ---
>> v1 --> v2:
>> set FLAG_SKB_NO_MERGE for skbuff_fclone_cache in initialization.
>>
>> v1: https://lkml.org/lkml/2024/2/20/121
>> ---
>> net/core/skbuff.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 1f918e602bc4..5e3e130fb57a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5013,7 +5013,8 @@ void __init skb_init(void)
>> skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
>> sizeof(struct sk_buff_fclones),
>> 0,
>> - SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|
>> + FLAG_SKB_NO_MERGE,
>> NULL);
>> /* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
>> * struct skb_shared_info is located at the end of skb->head,
>> --
>> 2.40.1
>>
Powered by blists - more mailing lists