[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251013101636.69220-1-21cnbao@gmail.com>
Date: Mon, 13 Oct 2025 18:16:36 +0800
From: Barry Song <21cnbao@...il.com>
To: netdev@...r.kernel.org,
linux-mm@...ck.org,
linux-doc@...r.kernel.org
Cc: linux-kernel@...r.kernel.org,
Barry Song <v-songbaohua@...o.com>,
Jonathan Corbet <corbet@....net>,
Eric Dumazet <edumazet@...gle.com>,
Kuniyuki Iwashima <kuniyu@...gle.com>,
Paolo Abeni <pabeni@...hat.com>,
Willem de Bruijn <willemb@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Simon Horman <horms@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>,
Suren Baghdasaryan <surenb@...gle.com>,
Michal Hocko <mhocko@...e.com>,
Brendan Jackman <jackmanb@...gle.com>,
Johannes Weiner <hannes@...xchg.org>,
Zi Yan <ziy@...dia.com>,
Yunsheng Lin <linyunsheng@...wei.com>,
Huacai Zhou <zhouhuacai@...o.com>
Subject: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
From: Barry Song <v-songbaohua@...o.com>
On phones, we have observed significant phone heating when running apps
with high network bandwidth. This is caused by the network stack frequently
waking kswapd for order-3 allocations. As a result, memory reclamation becomes
constantly active, even though plenty of memory is still available for network
allocations which can fall back to order-0.
Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
introduced high_order_alloc_disable for the transmit (TX) path
(skb_page_frag_refill()) to mitigate some memory reclamation issues,
allowing the TX path to fall back to order-0 immediately, while leaving the
receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
generally unaware of the sysctl and cannot easily adjust it for specific use
cases. Enabling high_order_alloc_disable also completely disables the
benefit of order-3 allocations. Additionally, the sysctl does not apply to the
RX path.
An alternative approach is to disable kswapd for these frequent
allocations and provide best-effort order-3 service for both TX and RX paths,
while removing the sysctl entirely.
Cc: Jonathan Corbet <corbet@....net>
Cc: Eric Dumazet <edumazet@...gle.com>
Cc: Kuniyuki Iwashima <kuniyu@...gle.com>
Cc: Paolo Abeni <pabeni@...hat.com>
Cc: Willem de Bruijn <willemb@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>
Cc: Jakub Kicinski <kuba@...nel.org>
Cc: Simon Horman <horms@...nel.org>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Suren Baghdasaryan <surenb@...gle.com>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Brendan Jackman <jackmanb@...gle.com>
Cc: Johannes Weiner <hannes@...xchg.org>
Cc: Zi Yan <ziy@...dia.com>
Cc: Yunsheng Lin <linyunsheng@...wei.com>
Cc: Huacai Zhou <zhouhuacai@...o.com>
Signed-off-by: Barry Song <v-songbaohua@...o.com>
---
Documentation/admin-guide/sysctl/net.rst | 12 ------------
include/net/sock.h | 1 -
mm/page_frag_cache.c | 2 +-
net/core/sock.c | 8 ++------
net/core/sysctl_net_core.c | 7 -------
5 files changed, 3 insertions(+), 27 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 2ef50828aff1..b903bbae239c 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
list is then passed to the stack when the number of segments reaches the
gro_normal_batch limit.
-high_order_alloc_disable
-------------------------
-
-By default the allocator for page frags tries to use high order pages (order-3
-on x86). While the default behavior gives good results in most cases, some users
-might have hit a contention in page allocations/freeing. This was especially
-true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
-lists. This allows to opt-in for order-0 allocation instead but is now mostly of
-historical importance.
-
-Default: 0
-
2. /proc/sys/net/unix - Parameters for Unix domain sockets
----------------------------------------------------------
diff --git a/include/net/sock.h b/include/net/sock.h
index 60bcb13f045c..62306c1095d5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
extern __u32 sysctl_rmem_default;
#define SKB_FRAG_PAGE_ORDER get_order(32768)
-DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
{
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..dd36114dd16f 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ gfp_mask = (gfp_mask & ~__GFP_RECLAIM) | __GFP_COMP |
__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
numa_mem_id(), NULL);
diff --git a/net/core/sock.c b/net/core/sock.c
index dc03d4b5909a..1fa1e9177d86 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
}
}
-DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
-
/**
* skb_page_frag_refill - check that a page_frag contains enough room
* @sz: minimum size of the fragment we want to get
@@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
}
pfrag->offset = 0;
- if (SKB_FRAG_PAGE_ORDER &&
- !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
- /* Avoid direct reclaim but allow kswapd to wake */
- pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+ if (SKB_FRAG_PAGE_ORDER) {
+ pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
__GFP_COMP | __GFP_NOWARN |
__GFP_NORETRY,
SKB_FRAG_PAGE_ORDER);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8cf04b57ade1..181f6532beb8 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_THREE,
},
- {
- .procname = "high_order_alloc_disable",
- .data = &net_high_order_alloc_disable_key.key,
- .maxlen = sizeof(net_high_order_alloc_disable_key),
- .mode = 0644,
- .proc_handler = proc_do_static_key,
- },
{
.procname = "gro_normal_batch",
.data = &net_hotdata.gro_normal_batch,
--
2.39.3 (Apple Git-146)
Powered by blists - more mailing lists