lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <927bcdf7-1283-4ddd-bd5e-d2e399b26f7d@suse.cz>
Date: Mon, 13 Oct 2025 20:30:13 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Barry Song <21cnbao@...il.com>, netdev@...r.kernel.org,
 linux-mm@...ck.org, linux-doc@...r.kernel.org
Cc: linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
 Jonathan Corbet <corbet@....net>, Eric Dumazet <edumazet@...gle.com>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Willem de Bruijn <willemb@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Simon Horman <horms@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>,
 Michal Hocko <mhocko@...e.com>, Brendan Jackman <jackmanb@...gle.com>,
 Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>,
 Yunsheng Lin <linyunsheng@...wei.com>, Huacai Zhou <zhouhuacai@...o.com>,
 Alexei Starovoitov <alexei.starovoitov@...il.com>,
 Harry Yoo <harry.yoo@...cle.com>, David Hildenbrand <david@...hat.com>,
 Matthew Wilcox <willy@...radead.org>,
 Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer
 allocation

On 10/13/25 12:16, Barry Song wrote:
> From: Barry Song <v-songbaohua@...o.com>
> 
> On phones, we have observed significant phone heating when running apps
> with high network bandwidth. This is caused by the network stack frequently
> waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> constantly active, even though plenty of memory is still available for network
> allocations which can fall back to order-0.
> 
> Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> introduced high_order_alloc_disable for the transmit (TX) path
> (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> allowing the TX path to fall back to order-0 immediately, while leaving the
> receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> generally unaware of the sysctl and cannot easily adjust it for specific use
> cases. Enabling high_order_alloc_disable also completely disables the
> benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> RX path.
> 
> An alternative approach is to disable kswapd for these frequent
> allocations and provide best-effort order-3 service for both TX and RX paths,
> while removing the sysctl entirely.
> 
> Cc: Jonathan Corbet <corbet@....net>
> Cc: Eric Dumazet <edumazet@...gle.com>
> Cc: Kuniyuki Iwashima <kuniyu@...gle.com>
> Cc: Paolo Abeni <pabeni@...hat.com>
> Cc: Willem de Bruijn <willemb@...gle.com>
> Cc: "David S. Miller" <davem@...emloft.net>
> Cc: Jakub Kicinski <kuba@...nel.org>
> Cc: Simon Horman <horms@...nel.org>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Suren Baghdasaryan <surenb@...gle.com>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: Brendan Jackman <jackmanb@...gle.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: Zi Yan <ziy@...dia.com>
> Cc: Yunsheng Lin <linyunsheng@...wei.com>
> Cc: Huacai Zhou <zhouhuacai@...o.com>
> Signed-off-by: Barry Song <v-songbaohua@...o.com>
> ---
>  Documentation/admin-guide/sysctl/net.rst | 12 ------------
>  include/net/sock.h                       |  1 -
>  mm/page_frag_cache.c                     |  2 +-
>  net/core/sock.c                          |  8 ++------
>  net/core/sysctl_net_core.c               |  7 -------
>  5 files changed, 3 insertions(+), 27 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> index 2ef50828aff1..b903bbae239c 100644
> --- a/Documentation/admin-guide/sysctl/net.rst
> +++ b/Documentation/admin-guide/sysctl/net.rst
> @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
>  list is then passed to the stack when the number of segments reaches the
>  gro_normal_batch limit.
>  
> -high_order_alloc_disable
> -------------------------
> -
> -By default the allocator for page frags tries to use high order pages (order-3
> -on x86). While the default behavior gives good results in most cases, some users
> -might have hit a contention in page allocations/freeing. This was especially
> -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> -historical importance.
> -
> -Default: 0
> -
>  2. /proc/sys/net/unix - Parameters for Unix domain sockets
>  ----------------------------------------------------------
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 60bcb13f045c..62306c1095d5 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
>  extern __u32 sysctl_rmem_default;
>  
>  #define SKB_FRAG_PAGE_ORDER	get_order(32768)
> -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
>  
>  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
>  {
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..dd36114dd16f 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  	gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -	gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +	gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
>  		   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;

I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
fine for the page allocator itself where we have a different entry point
that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully

I wonder if we should either:

1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
determine it precisely.

2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
not being disturbing (like proposed here), but that can in fact allow
spinning. Instead, decide to not wake up kswapd by those when other
information indicates it's an opportunistic allocation
(~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
order > 0...)

3) something better?

Vlastimil

>  	page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
>  			     numa_mem_id(), NULL);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index dc03d4b5909a..1fa1e9177d86 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3085,8 +3085,6 @@ static void sk_leave_memory_pressure(struct sock *sk)
>  	}
>  }
>  
> -DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> -
>  /**
>   * skb_page_frag_refill - check that a page_frag contains enough room
>   * @sz: minimum size of the fragment we want to get
> @@ -3110,10 +3108,8 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  	}
>  
>  	pfrag->offset = 0;
> -	if (SKB_FRAG_PAGE_ORDER &&
> -	    !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> -		/* Avoid direct reclaim but allow kswapd to wake */
> -		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> +	if (SKB_FRAG_PAGE_ORDER) {
> +		pfrag->page = alloc_pages((gfp & ~__GFP_RECLAIM) |
>  					  __GFP_COMP | __GFP_NOWARN |
>  					  __GFP_NORETRY,
>  					  SKB_FRAG_PAGE_ORDER);
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index 8cf04b57ade1..181f6532beb8 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -599,13 +599,6 @@ static struct ctl_table net_core_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_THREE,
>  	},
> -	{
> -		.procname	= "high_order_alloc_disable",
> -		.data		= &net_high_order_alloc_disable_key.key,
> -		.maxlen         = sizeof(net_high_order_alloc_disable_key),
> -		.mode		= 0644,
> -		.proc_handler	= proc_do_static_key,
> -	},
>  	{
>  		.procname	= "gro_normal_batch",
>  		.data		= &net_hotdata.gro_normal_batch,


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ