lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dhmafwxu2jj4lu6acoqdhqh46k33sbsj5jvepcfzly4c7dn2t7@ln5dgubll4ac>
Date: Mon, 13 Oct 2025 14:35:28 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Barry Song <21cnbao@...il.com>, netdev@...r.kernel.org, 
	linux-mm@...ck.org, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, Jonathan Corbet <corbet@....net>, 
	Eric Dumazet <edumazet@...gle.com>, Kuniyuki Iwashima <kuniyu@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, Willem de Bruijn <willemb@...gle.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Simon Horman <horms@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>, 
	Michal Hocko <mhocko@...e.com>, Brendan Jackman <jackmanb@...gle.com>, 
	Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>, Yunsheng Lin <linyunsheng@...wei.com>, 
	Huacai Zhou <zhouhuacai@...o.com>, Alexei Starovoitov <alexei.starovoitov@...il.com>, 
	Harry Yoo <harry.yoo@...cle.com>, David Hildenbrand <david@...hat.com>, 
	Matthew Wilcox <willy@...radead.org>, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network
 buffer allocation

On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> On 10/13/25 12:16, Barry Song wrote:
> > From: Barry Song <v-songbaohua@...o.com>
> > 
> > On phones, we have observed significant phone heating when running apps
> > with high network bandwidth. This is caused by the network stack frequently
> > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > constantly active, even though plenty of memory is still available for network
> > allocations which can fall back to order-0.
> > 
> > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > introduced high_order_alloc_disable for the transmit (TX) path
> > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > allowing the TX path to fall back to order-0 immediately, while leaving the
> > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > generally unaware of the sysctl and cannot easily adjust it for specific use
> > cases. Enabling high_order_alloc_disable also completely disables the
> > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > RX path.
> > 
> > An alternative approach is to disable kswapd for these frequent
> > allocations and provide best-effort order-3 service for both TX and RX paths,
> > while removing the sysctl entirely.
> > 
> > Cc: Jonathan Corbet <corbet@....net>
> > Cc: Eric Dumazet <edumazet@...gle.com>
> > Cc: Kuniyuki Iwashima <kuniyu@...gle.com>
> > Cc: Paolo Abeni <pabeni@...hat.com>
> > Cc: Willem de Bruijn <willemb@...gle.com>
> > Cc: "David S. Miller" <davem@...emloft.net>
> > Cc: Jakub Kicinski <kuba@...nel.org>
> > Cc: Simon Horman <horms@...nel.org>
> > Cc: Vlastimil Babka <vbabka@...e.cz>
> > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > Cc: Michal Hocko <mhocko@...e.com>
> > Cc: Brendan Jackman <jackmanb@...gle.com>
> > Cc: Johannes Weiner <hannes@...xchg.org>
> > Cc: Zi Yan <ziy@...dia.com>
> > Cc: Yunsheng Lin <linyunsheng@...wei.com>
> > Cc: Huacai Zhou <zhouhuacai@...o.com>
> > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > ---
> >  Documentation/admin-guide/sysctl/net.rst | 12 ------------
> >  include/net/sock.h                       |  1 -
> >  mm/page_frag_cache.c                     |  2 +-
> >  net/core/sock.c                          |  8 ++------
> >  net/core/sysctl_net_core.c               |  7 -------
> >  5 files changed, 3 insertions(+), 27 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > index 2ef50828aff1..b903bbae239c 100644
> > --- a/Documentation/admin-guide/sysctl/net.rst
> > +++ b/Documentation/admin-guide/sysctl/net.rst
> > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> >  list is then passed to the stack when the number of segments reaches the
> >  gro_normal_batch limit.
> >  
> > -high_order_alloc_disable
> > -------------------------
> > -
> > -By default the allocator for page frags tries to use high order pages (order-3
> > -on x86). While the default behavior gives good results in most cases, some users
> > -might have hit a contention in page allocations/freeing. This was especially
> > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > -historical importance.
> > -
> > -Default: 0
> > -
> >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> >  ----------------------------------------------------------
> >  
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 60bcb13f045c..62306c1095d5 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> >  extern __u32 sysctl_rmem_default;
> >  
> >  #define SKB_FRAG_PAGE_ORDER	get_order(32768)
> > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> >  
> >  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> >  {
> > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > index d2423f30577e..dd36114dd16f 100644
> > --- a/mm/page_frag_cache.c
> > +++ b/mm/page_frag_cache.c
> > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >  	gfp_t gfp = gfp_mask;
> >  
> >  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > -	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> > +	gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
> >  		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> 
> I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> fine for the page allocator itself where we have a different entry point
> that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> 
> I wonder if we should either:
> 
> 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> determine it precisely.
> 
> 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> not being disturbing (like proposed here), but that can in fact allow
> spinning. Instead, decide to not wake up kswapd by those when other
> information indicates it's an opportunistic allocation
> (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> order > 0...)
> 
> 3) something better?
> 

For the !allow_spin allocations, I think we should just add a new __GFP
flag instead of adding more complexity to other allocators which may or
may not want kswapd wakeup for many different reasons.




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ