[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251014035846.1519-1-21cnbao@gmail.com>
Date: Tue, 14 Oct 2025 11:58:46 +0800
From: Barry Song <21cnbao@...il.com>
To: edumazet@...gle.com
Cc: 21cnbao@...il.com,
corbet@....net,
davem@...emloft.net,
hannes@...xchg.org,
horms@...nel.org,
jackmanb@...gle.com,
kuba@...nel.org,
kuniyu@...gle.com,
linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
linyunsheng@...wei.com,
mhocko@...e.com,
netdev@...r.kernel.org,
pabeni@...hat.com,
surenb@...gle.com,
v-songbaohua@...o.com,
vbabka@...e.cz,
willemb@...gle.com,
zhouhuacai@...o.com,
ziy@...dia.com
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation
> >
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > list is then passed to the stack when the number of segments reaches the
> > gro_normal_batch limit.
> >
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
>
> The sysctl is quite useful for testing purposes, say on a freshly
> booted host, with plenty of free memory.
>
> Also, having order-3 pages if possible is quite important for IOMM use cases.
>
> Perhaps kswapd should have some kind of heuristic to not start if a
> recent run has already happened.
I don’t understand why it shouldn’t start when users continuously request
order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
make sense logically to skip it just because earlier requests were already
satisfied.
>
> I am guessing phones do not need to send 1.6 Tbit per second on
> network devices (yet),
> an option could be to disable it in your boot scripts.
A problem with the existing sysctl is that it only covers the TX path;
for the RX path, we also observe that kswapd consumes significant power.
I could add the patch below to make it support the RX path, but it feels
like a bit of a layer violation, since the RX path code resides in mm
and is intended to serve generic users rather than networking, even
though the current callers are primarily network-related.
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d2423f30577e..8ad18ec49f39 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -18,6 +18,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/page_frag_cache.h>
+#include <net/sock.h>
#include "internal.h"
static unsigned long encoded_page_create(struct page *page, unsigned int order,
@@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
gfp_t gfp = gfp_mask;
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
- __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
- page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
- numa_mem_id(), NULL);
+ if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
+ gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP |
+ __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+ page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
+ numa_mem_id(), NULL);
+ }
#endif
if (unlikely(!page)) {
Do you have a better idea on how to make the sysctl also cover the RX path?
Thanks
Barry
Powered by blists - more mailing lists