lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQKEhfFTSkn6f_PJr6xMcjB4d45E_+TsU6+945f2XD1SmA@mail.gmail.com>
Date: Mon, 13 Oct 2025 14:53:17 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Vlastimil Babka <vbabka@...e.cz>, Barry Song <21cnbao@...il.com>, 
	Network Development <netdev@...r.kernel.org>, linux-mm <linux-mm@...ck.org>, 
	"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	Barry Song <v-songbaohua@...o.com>, Jonathan Corbet <corbet@....net>, 
	Eric Dumazet <edumazet@...gle.com>, Kuniyuki Iwashima <kuniyu@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
	Willem de Bruijn <willemb@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, 
	Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>, 
	Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>, 
	Yunsheng Lin <linyunsheng@...wei.com>, Huacai Zhou <zhouhuacai@...o.com>, 
	Harry Yoo <harry.yoo@...cle.com>, David Hildenbrand <david@...hat.com>, 
	Matthew Wilcox <willy@...radead.org>, Roman Gushchin <roman.gushchin@...ux.dev>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation

On Mon, Oct 13, 2025 at 2:35 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Mon, Oct 13, 2025 at 08:30:13PM +0200, Vlastimil Babka wrote:
> > On 10/13/25 12:16, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@...o.com>
> > >
> > > On phones, we have observed significant phone heating when running apps
> > > with high network bandwidth. This is caused by the network stack frequently
> > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > constantly active, even though plenty of memory is still available for network
> > > allocations which can fall back to order-0.
> > >
> > > Commit ce27ec60648d ("net: add high_order_alloc_disable sysctl/static key")
> > > introduced high_order_alloc_disable for the transmit (TX) path
> > > (skb_page_frag_refill()) to mitigate some memory reclamation issues,
> > > allowing the TX path to fall back to order-0 immediately, while leaving the
> > > receive (RX) path (__page_frag_cache_refill()) unaffected. Users are
> > > generally unaware of the sysctl and cannot easily adjust it for specific use
> > > cases. Enabling high_order_alloc_disable also completely disables the
> > > benefit of order-3 allocations. Additionally, the sysctl does not apply to the
> > > RX path.
> > >
> > > An alternative approach is to disable kswapd for these frequent
> > > allocations and provide best-effort order-3 service for both TX and RX paths,
> > > while removing the sysctl entirely.
> > >
> > > Cc: Jonathan Corbet <corbet@....net>
> > > Cc: Eric Dumazet <edumazet@...gle.com>
> > > Cc: Kuniyuki Iwashima <kuniyu@...gle.com>
> > > Cc: Paolo Abeni <pabeni@...hat.com>
> > > Cc: Willem de Bruijn <willemb@...gle.com>
> > > Cc: "David S. Miller" <davem@...emloft.net>
> > > Cc: Jakub Kicinski <kuba@...nel.org>
> > > Cc: Simon Horman <horms@...nel.org>
> > > Cc: Vlastimil Babka <vbabka@...e.cz>
> > > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > > Cc: Michal Hocko <mhocko@...e.com>
> > > Cc: Brendan Jackman <jackmanb@...gle.com>
> > > Cc: Johannes Weiner <hannes@...xchg.org>
> > > Cc: Zi Yan <ziy@...dia.com>
> > > Cc: Yunsheng Lin <linyunsheng@...wei.com>
> > > Cc: Huacai Zhou <zhouhuacai@...o.com>
> > > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/net.rst | 12 ------------
> > >  include/net/sock.h                       |  1 -
> > >  mm/page_frag_cache.c                     |  2 +-
> > >  net/core/sock.c                          |  8 ++------
> > >  net/core/sysctl_net_core.c               |  7 -------
> > >  5 files changed, 3 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > >  list is then passed to the stack when the number of segments reaches the
> > >  gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> > > -Default: 0
> > > -
> > >  2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > >  ----------------------------------------------------------
> > >
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 60bcb13f045c..62306c1095d5 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -3011,7 +3011,6 @@ extern __u32 sysctl_wmem_default;
> > >  extern __u32 sysctl_rmem_default;
> > >
> > >  #define SKB_FRAG_PAGE_ORDER        get_order(32768)
> > > -DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
> > >
> > >  static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
> > >  {
> > > diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> > > index d2423f30577e..dd36114dd16f 100644
> > > --- a/mm/page_frag_cache.c
> > > +++ b/mm/page_frag_cache.c
> > > @@ -54,7 +54,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> > >     gfp_t gfp = gfp_mask;
> > >
> > >  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > > -   gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> > > +   gfp_mask = (gfp_mask & ~__GFP_RECLAIM) |  __GFP_COMP |
> > >                __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >
> > I'm a bit worried about proliferating "~__GFP_RECLAIM" allocations now that
> > we introduced alloc_pages_nolock() and kmalloc_nolock() where it's
> > interpreted as "cannot spin" - see gfpflags_allow_spinning(). Currently it's
> > fine for the page allocator itself where we have a different entry point
> > that uses ALLOC_TRYLOCK, but it can affect nested allocations of all kinds
> > of debugging and accounting metadata (page_owner, memcg, alloc tags for slab
> > objects etc). kmalloc_nolock() relies on gfpflags_allow_spinning() fully
> >
> > I wonder if we should either:
> >
> > 1) sacrifice a new __GFP flag specifically for "!allow_spin" case to
> > determine it precisely.
> >
> > 2) keep __GFP_KSWAPD_RECLAIM for allocations that remove it for purposes of
> > not being disturbing (like proposed here), but that can in fact allow
> > spinning. Instead, decide to not wake up kswapd by those when other
> > information indicates it's an opportunistic allocation
> > (~__GFP_DIRECT_RECLAIM, _GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
> > order > 0...)
> >
> > 3) something better?
> >
>
> For the !allow_spin allocations, I think we should just add a new __GFP
> flag instead of adding more complexity to other allocators which may or
> may not want kswapd wakeup for many different reasons.

That's what I proposed long ago, but was convinced that the new flag
adds more complexity. Looks like we walked this road far enough and
the new flag will actually make things simpler.
Back then I proposed __GFP_TRYLOCK which is not a good name.
How about __GFP_NOLOCK ? or __GFP_NOSPIN ?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ