netdev - Re: [PATCH net-next v4] net: skb: introduce and use a single page frag cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <06a1cdb330980e3df051c95ae089fd77afee839b.camel@redhat.com>
Date:   Sun, 02 Oct 2022 16:55:13 +0200
From:   Paolo Abeni <pabeni@...hat.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     patchwork-bot+netdevbpf@...nel.org,
        netdev <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Alexander Duyck <alexanderduyck@...com>
Subject: Re: [PATCH net-next v4] net: skb: introduce and use a single page
 frag cache

On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@...hat.com> wrote:
> > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > 
> > > Before the patch, cpus servicing NIC interrupts were allocating
> > > SLAB/SLUB objects for incoming packets,
> > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > were processed. SLAB/SLUB caches
> > > were efficient (hit ratio close to 100%)
> > 
> > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> 
> No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> host with 256 cpus...

The most similar host I can easily grab is a 2 numa nodes AMD with 16
cores/32 threads each.

I tried tcp_rr with different number of flows in (1-2048) range with
both slub (I tried first that because is the default allocator in my
builds) and slab but so far I can't reproduce it: results differences
between pre-patch and post-patch kernel are within noise and numastat
never show misses.

I'm likely missing some incredient to the recipe. I'll try next to pin
the netperf/netserver processes on a numa node different from the NIC's
one and to increase the number of concurrent flows.

I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
freeing deferral to per-cpu lists") the CPUs servicing the NIC
interrupts both allocate and (defer) free the memory for the incoming
packets,  so they should not have to access remote caches ?!? Or at
least isn't the allocator behavior always asymmetric, with rx alloc, rx
free, tx free on the same core and tx alloc possibly on a different
one?

> 
> We could first try in tcp_stream_alloc_skb()

I'll try to test something alike the following - after reproducing the
issue.
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f15d5b62539b..d5e9be98e8bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
 }
 
+#if PAGE_SIZE == SZ_4K
+
+#define NAPI_HAS_SMALL_PAGE_FRAG	1
+
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+	if (size <= SKB_WITH_OVERHEAD(1024))
+		return __alloc_skb_fclone_frag(size, priority);
+
+	return alloc_skb_fclone(size, priority);
+}
+
+#else
+#define NAPI_HAS_SMALL_PAGE_FRAG	0
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+	return alloc_skb_fclone(size, priority);
+}
+
+#endif
+
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 81d63f95e865..0c63653c9951 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
 #define NAPI_SKB_CACHE_BULK	16
 #define NAPI_SKB_CACHE_HALF	(NAPI_SKB_CACHE_SIZE / 2)
 
-#if PAGE_SIZE == SZ_4K
+#if NAPI_HAS_SMALL_PAGE_FRAG
 
-#define NAPI_HAS_SMALL_PAGE_FRAG	1
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	((nc).pfmemalloc)
 
 /* specialized page frag allocator using a single order 0 page
@@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
 	nc->offset = offset;
 	return nc->va + offset;
 }
+
 #else
 
 /* the small page is actually unused in this build; add dummy helpers
  * to please the compiler and avoid later preprocessor's conditionals
  */
-#define NAPI_HAS_SMALL_PAGE_FRAG	0
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	false
 
 struct page_frag_1k {
@@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 }
 EXPORT_SYMBOL(__alloc_skb);
 
+#if NAPI_HAS_SMALL_PAGE_FRAG
+/* optimized skb fast-clone allocation variant, using the small
+ * page frag cache
+ */
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
+{
+	struct sk_buff_fclones *fclones;
+	struct napi_alloc_cache *nc;
+	struct sk_buff *skb;
+	bool pfmemalloc;
+	void *data;
+
+	/* Get the HEAD */
+	skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
+	if (unlikely(!skb))
+		return NULL;
+	prefetchw(skb);
+
+	/* We can access the napi cache only in BH context */
+	nc = this_cpu_ptr(&napi_alloc_cache);
+	local_bh_disable();
+	data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
+	pfmemalloc = nc->page_small.pfmemalloc;
+	local_bh_enable();
+	if (unlikely(!data))
+		goto nodata;
+
+	prefetchw(data + SKB_WITH_OVERHEAD(SZ_1K));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	__build_skb_around(skb, data, SZ_1K);
+	skb->pfmemalloc = pfmemalloc;
+	skb->fclone = SKB_FCLONE_ORIG;
+	skb->head_frag = 1;
+
+	fclones = container_of(skb, struct sk_buff_fclones, skb1);
+	refcount_set(&fclones->fclone_ref, 1);
+
+	return skb;
+
+nodata:
+	kmem_cache_free(skbuff_fclone_cache, skb);
+	return NULL;
+}
+#endif
+
 /**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 829beee3fa32..3bcc3e1d9b19 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -858,7 +858,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb_fclone(size + MAX_TCP_HEADER, gfp);
+	skb = alloc_skb_fclone_frag(size + MAX_TCP_HEADER, gfp);
 	if (likely(skb)) {
 		bool mem_scheduled;