lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKCZyYi+J=5t2sdmvtERnknkwXrGi4QRzM9btYUywkDfw@mail.gmail.com>
Date: Mon, 13 Oct 2025 22:07:55 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: corbet@....net, davem@...emloft.net, hannes@...xchg.org, horms@...nel.org, 
	jackmanb@...gle.com, kuba@...nel.org, kuniyu@...gle.com, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	linyunsheng@...wei.com, mhocko@...e.com, netdev@...r.kernel.org, 
	pabeni@...hat.com, surenb@...gle.com, v-songbaohua@...o.com, vbabka@...e.cz, 
	willemb@...gle.com, zhouhuacai@...o.com, ziy@...dia.com
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation

On Mon, Oct 13, 2025 at 8:58 PM Barry Song <21cnbao@...il.com> wrote:
>
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
> > > index 2ef50828aff1..b903bbae239c 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -415,18 +415,6 @@ GRO has decided not to coalesce, it is placed on a per-NAPI list. This
> > >  list is then passed to the stack when the number of segments reaches the
> > >  gro_normal_batch limit.
> > >
> > > -high_order_alloc_disable
> > > -------------------------
> > > -
> > > -By default the allocator for page frags tries to use high order pages (order-3
> > > -on x86). While the default behavior gives good results in most cases, some users
> > > -might have hit a contention in page allocations/freeing. This was especially
> > > -true on older kernels (< 5.14) when high-order pages were not stored on per-cpu
> > > -lists. This allows to opt-in for order-0 allocation instead but is now mostly of
> > > -historical importance.
> > > -
> >
> > The sysctl is quite useful for testing purposes, say on a freshly
> > booted host, with plenty of free memory.
> >
> > Also, having order-3 pages if possible is quite important for IOMM use cases.
> >
> > Perhaps kswapd should have some kind of heuristic to not start if a
> > recent run has already happened.
>
> I don’t understand why it shouldn’t start when users continuously request
> order-3 allocations and ask kswapd to prepare order-3 memory — it doesn’t
> make sense logically to skip it just because earlier requests were already
> satisfied.
>
> >
> > I am guessing phones do not need to send 1.6 Tbit per second on
> > network devices (yet),
> > an option  could be to disable it in your boot scripts.
>
> A problem with the existing sysctl is that it only covers the TX path;
> for the RX path, we also observe that kswapd consumes significant power.
> I could add the patch below to make it support the RX path, but it feels
> like a bit of a layer violation, since the RX path code resides in mm
> and is intended to serve generic users rather than networking, even
> though the current callers are primarily network-related.

You might have a buggy driver.

High performance drivers use order-0 allocations only.



>
> diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
> index d2423f30577e..8ad18ec49f39 100644
> --- a/mm/page_frag_cache.c
> +++ b/mm/page_frag_cache.c
> @@ -18,6 +18,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/page_frag_cache.h>
> +#include <net/sock.h>
>  #include "internal.h"
>
>  static unsigned long encoded_page_create(struct page *page, unsigned int order,
> @@ -54,10 +55,12 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>         gfp_t gfp = gfp_mask;
>
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> -       gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> -                  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> -       page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> -                            numa_mem_id(), NULL);
> +       if (!static_branch_unlikely(&net_high_order_alloc_disable_key)) {
> +               gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +                       __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> +               page = __alloc_pages(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER,
> +                               numa_mem_id(), NULL);
> +       }
>  #endif
>         if (unlikely(!page)) {
>
>
> Do you have a better idea on how to make the sysctl also cover the RX path?
>
> Thanks
> Barry
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ