lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 24 Apr 2024 22:49:00 +0200
From: Sabrina Dubroca <sd@...asysnail.net>
To: Paolo Abeni <pabeni@...hat.com>
Cc: David J Wilder <dwilder@...ibm.com>, netdev@...r.kernel.org,
	edumazet@...gle.com
Subject: Re: [RFC PATCH] net: skb: Increasing allocation in
 __napi_alloc_skb() to 2k when needed.

2024-04-23, 09:56:33 +0200, Paolo Abeni wrote:
> On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
> > When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
> > couple of issues.
> > 
> > __napi_alloc_skb() assumes its smallest fragment allocations will fit in
> > 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
> > resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
> > This results in the return of the bug fixed in:
> > Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
> > tiny skbs")
> > 
> > That commit insured that "small skb head fragments are kmalloc backed,
> > so that other objects in the slab page can be reused instead of being held
> > as long as skbs are sitting in socket queues."
> > 
> > On ppc64le the warning from napi_get_frags_check() is displayed when
> > CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
> > when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
> > Unfortunately on x86_64 this warning is not seen, even though it should be.
> > I found the warning was disabled by:
> > commit dbae2b062824 ("net: skb: introduce and use a single page frag
> > cache")
> > 
> > This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
> > with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
> > increasing the allocation to 2k if needed.
> > 
> > I have functionally tested this patch, performance testing is still needed.
> > 
> > TBD: Remove the limitation on 4k page size from the single page frag cache
> > allowing ppc64le (64K page size) to benefit from this change.
> > 
> > TBD: I have not address the warning in napi_get_frags_check() on x86_64.
> > Will the warning still be needed once the other changes are completed?
> 
> 
> Thanks for the detailed analysis.
> 
> As mentioned by Eric in commit
> bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
> revert dbae2b062824 without incurring in performance regressions for
> the relevant use-case. I had that on my todo list since a lot of time,
> but I was unable to allocate time for that.
> 
> I think such revert would be preferable. Would you be able to evaluate
> such option?

I don't think reverting dbae2b062824 would fix David's issue.

The problem is that with MAX_SKB_FRAGS=45, skb_shared_info becomes
huge, so 1024 is not enough for those small packets, and we use a
pagefrag instead of kmalloc, which makes napi_get_frags_check unhappy.

Even after reverting dbae2b062824, we would still go through the
pagefrag path and not __alloc_skb.

What about something like this?  (boot-tested on x86 only, but I
disabled NAPI_HAS_SMALL_PAGE_FRAG. no perf testing at all.)

-------- 8< --------
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f85e6989c36c..88923b7b64fe 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -108,6 +108,8 @@ static struct kmem_cache *skbuff_ext_cache __ro_after_init;
 #define SKB_SMALL_HEAD_HEADROOM						\
 	SKB_WITH_OVERHEAD(SKB_SMALL_HEAD_CACHE_SIZE)
 
+#define SKB_SMALL_HEAD_THRESHOLD (SKB_SMALL_HEAD_HEADROOM + NET_SKB_PAD + NET_IP_ALIGN)
+
 int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
 EXPORT_SYMBOL(sysctl_max_skb_frags);
 
@@ -726,7 +728,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	/* If requested length is either too small or too big,
 	 * we use kmalloc() for skb->head allocation.
 	 */
-	if (len <= SKB_WITH_OVERHEAD(1024) ||
+	if (len <= SKB_SMALL_HEAD_THRESHOLD ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
@@ -802,7 +804,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
 	 * When the small frag allocator is available, prefer it over kmalloc
 	 * for small fragments
 	 */
-	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
+	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_SMALL_HEAD_THRESHOLD) ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
-------- 8< --------

(__)napi_alloc_skb extends the GRO_MAX_HEAD size by NET_SKB_PAD +
NET_IP_ALIGN, so I added them here as well. Mainly this is reusing a
size that we know if big enough to fit a small header and whatever
size skb_shared_info is on the current build. Maybe this could be
max(SKB_WITH_OVERHEAD(1024), <...>) to preserve the current behavior
on MAX_SKB_FRAGS=17, since in that case
SKB_WITH_OVERHEAD(1024) > SKB_SMALL_HEAD_HEADROOM

-- 
Sabrina


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ