netdev - Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89i+Cr1Tbdxqy6fB-sOLca+AHFc-3-0xGktVUsQFFMVsC0A@mail.gmail.com>
Date: Sat, 24 Feb 2024 20:07:55 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Shijie Huang <shijie@...eremail.onmicrosoft.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Huang Shijie <shijie@...amperecomputing.com>, kuba@...nel.org, 
	patches@...erecomputing.com, davem@...emloft.net, horms@...nel.org, 
	ast@...nel.org, dhowells@...hat.com, linyunsheng@...wei.com, 
	aleksander.lobakin@...el.com, linux-kernel@...r.kernel.org, 
	netdev@...r.kernel.org, cl@...amperecomputing.com
Subject: Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node

On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
<shijie@...eremail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/20 16:17, Eric Dumazet 写道:
> > On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> > <shijie@...eremail.onmicrosoft.com> wrote:
> >>
> >> 在 2024/2/20 13:32, Eric Dumazet 写道:
> >>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> >>> <shijie@...amperecomputing.com> wrote:
> >>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >>>> it may creates fclone SKB in remote NUMA node.
> >>> This is intended (WAI)
> >> Okay. thanks a lot.
> >>
> >> It seems I should fix the issue in other code, not the networking.
> >>
> >>> What about the NUMA policies of the current thread ?
> >> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> >> fclone in
> >>
> >> node 0, but we can see many fclones were allocated in node 1.
> >>
> >> We have enough memory to allocate these fclones in node 0.
> >>
> >>> Has NUMA_NO_NODE behavior changed recently?
> >> I guess not.
> >>> What means : "it may creates" ? Please be more specific.
> >> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> >> fclones were allocated in
> >>
> >> remote NUMA node.
> > Interesting, how was it measured exactly ?
>
> I created a private patch to record the status for each fclone allocation.
>
>
> > Are you using SLUB or SLAB ?
>
> I think I use SLUB. (CONFIG_SLUB=y,
> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
>

A similar issue comes from tx_action() calling __napi_kfree_skb() on
arbitrary skbs
including ones that were allocated on a different NUMA node.

This pollutes per-cpu caches with not optimally placed sk_buff :/

Although this should not impact fclones, __napi_kfree_skb() only ?

commit 15fad714be86eab13e7568fecaf475b2a9730d3e
Author: Jesper Dangaard Brouer <brouer@...hat.com>
Date:   Mon Feb 8 13:15:04 2016 +0100

    net: bulk free SKBs that were delay free'ed due to IRQ context

What about :

diff --git a/net/core/dev.c b/net/core/dev.c
index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5162,11 +5162,7 @@ static __latent_entropy void
net_tx_action(struct softirq_action *h)
                                trace_kfree_skb(skb, net_tx_action,
                                                get_kfree_skb_cb(skb)->reason);

-                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
-                               __kfree_skb(skb);
-                       else
-                               __napi_kfree_skb(skb,
-                                                get_kfree_skb_cb(skb)->reason);
+                       __kfree_skb(skb);
                }
        }