netdev - Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+NudPMsKDswRfx3R=opyAKBTw+nxrpkMZTQsgnOGJ_DA@mail.gmail.com>
Date: Wed, 28 Feb 2024 10:38:37 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Shijie Huang <shijie@...eremail.onmicrosoft.com>
Cc: Huang Shijie <shijie@...amperecomputing.com>, kuba@...nel.org, 
	patches@...erecomputing.com, davem@...emloft.net, horms@...nel.org, 
	ast@...nel.org, dhowells@...hat.com, linyunsheng@...wei.com, 
	aleksander.lobakin@...el.com, linux-kernel@...r.kernel.org, 
	netdev@...r.kernel.org, cl@...amperecomputing.com
Subject: Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache

On Wed, Feb 28, 2024 at 8:06 AM Shijie Huang
<shijie@...eremail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/27 20:55, Eric Dumazet 写道:
> > On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> > <shijie@...amperecomputing.com> wrote:
> >> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
> >> the current skbuff_fclone_cache maybe not really allocated, it maybe
> >> used an exist old kmem_cache. In NUMA, the fclone allocated by
> >> alloc_skb_fclone() maybe in remote node.
> > Why is this happening in the first place ? Whab about skb->head ?
>
> I tested the fclone firstly. I did not test others yet.
>
> I did not check the skb->head yet.
>
> But I ever checked the pfrag's page, it is okay.
>
>
> >
> > Jesper patch [1] motivation was not about NUMA., but about
> > fragmentation and bulk allocations/freeing.
> >
> > TCP fclones are not bulk allocated/freed, so I do not understand what
> > your patch is doing.
> > You need to give more details, and experimental results.
>
> 1.) My NUMA machine:
>
>        node 0 (CPU 0 ~ CPU79):
>
>                       CPU 0 ~  CPU 39 are used as memcached's server
>
>                      CPU 40 ~  CPU 79 are used as memcached's client
>
>        node 1 (CPU 80 ~ CPU160):
>
>                       CPU 80 ~  CPU 119 are used as memcached's server
>
>                      CPU 120 ~  CPU 179 are used as memcached's client
>
>     the kernel is linux-next 20240227
>
>
>   2.) My private patches:
>
>        patch 1 is for slub:
>
>        ---
>   mm/slub.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 5d838ebfa35e..d2ab1e36fd6b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int
> size, unsigned int align,
>
>          s = find_mergeable(size, align, flags, name, ctor);
>          if (s) {
> +               printk("[%s] origin:%s, shared :%s\n", __func__, name,
> s->name);
>                  if (sysfs_slab_alias(s, name))
>                          return NULL;
>
> ---------
>
>    This patch is used the check which is the sharing kmem_cache for
> "skbuff_fclone_cache".
>
>    I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.
>
>    From my test, the "pool_workqueue" is the real working kmem_cache.
>
>    The "skbuff_fclone_cache" is just a pointer to "pool_workqueue"
> (pwq_cache).
>
>
>    The following private patch is used to record the fclone allocation:
>
>   ---
>   net/ipv4/tcp.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c82dc42f57c6..6f31ddcfc017 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t
> *ppos,
>   }
>   EXPORT_SYMBOL(tcp_splice_read);
>
> +unsigned long right_num, wrong_num;
> +static void check_fclone(struct sk_buff *skb)
> +{
> +       int n = numa_mem_id(); /* current node */
> +       int node2 = page_to_nid(virt_to_page((unsigned long) skb));
> +
> +       if (n != node2) {
> +               wrong_num++;
> +               if ((wrong_num % 1000) == 999)
> +                       printk(KERN_DEBUG "[%s] current:%d, get from
> :%d, (%ld, %ld, %ld)\n",
> +                               __func__, n, node2, wrong_num,
> right_num, wrong_num * 100 / (wrong_num + right_num));
> +       } else {
> +               right_num++;
> +               if ((right_num % 1000000) == 9999)
> +                       printk("[%s] we received :%ld, %ld\n", __func__,
> right_num, wrong_num);
> +       }
> +}
> +
>   struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
>                                       bool force_schedule)
>   {
> @@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock
> *sk, gfp_t gfp,
>                          skb_reserve(skb, MAX_TCP_HEADER);
>                          skb->ip_summed = CHECKSUM_PARTIAL;
> INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
> +                       check_fclone(skb);
>                          return skb;
>                  }
>                  __kfree_skb(skb);
> --
>
>    Without this v2 patch, I can get the result after the memcached test:
>
> [ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
> [ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
> [ 1027.804110] [check_fclone] we received :10009999, 7113326
>
> It means nearly 43% fclone is allocated in the remote node.
>
>
>   With this v2 patch,  I can find the "skbuff_fclone_cache" in
> /proc/slabinfo.
>
> The test result shows below:
>
> [  503.357293] [check_fclone] we received :8009999, 0
> [  503.357293] [check_fclone] we received :8009999, 0
> [  503.357305] [check_fclone] we received :8009999, 0
>
> After v2 patch, I cannot see the wrong fclone in remote node.
>
>
> >
> > Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> > on a dual socket
> > host with plenty of available memory.
> > (either sk_buff or skb->head being allocated on the other node).
>
> Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
>
> If so, I guess there is bug in the slub.
>
>
> > fclones might be allocated from a cpu running on node A, and freed
> > from a cpu running on node B.
> > Maybe SLUB is not properly handling this case ?
>
> Maybe.
>
>
>
> > SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> > really help.
> >
> > I think we need help from mm/slub experts, instead of trying to 'fix'
> > networking stacks.
>
> @Christopher
>
> Any idea about this?

I had a simpler bpftrace program to get an histogram of [my_node,
node(sk_buff), node_of(skb->head)]
and can tell that going back to linux-v6.7 and CONFIG_SLAB=y solved
all the issues for me.

99.9999% of SLAB allocations were on the right node.

This looks like a SLUB bug to me.