lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKSW-kk-h-B0f1oijwYiCWYOAO0jDrf+Z+fbOfAMJMUbA@mail.gmail.com>
Date: Tue, 14 Oct 2025 02:49:03 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: Matthew Wilcox <willy@...radead.org>, netdev@...r.kernel.org, linux-mm@...ck.org, 
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Barry Song <v-songbaohua@...o.com>, Jonathan Corbet <corbet@....net>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
	Willem de Bruijn <willemb@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, Vlastimil Babka <vbabka@...e.cz>, 
	Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>, 
	Brendan Jackman <jackmanb@...gle.com>, Johannes Weiner <hannes@...xchg.org>, Zi Yan <ziy@...dia.com>, 
	Yunsheng Lin <linyunsheng@...wei.com>, Huacai Zhou <zhouhuacai@...o.com>
Subject: Re: [RFC PATCH] mm: net: disable kswapd for high-order network buffer allocation

On Tue, Oct 14, 2025 at 1:58 AM Barry Song <21cnbao@...il.com> wrote:
>
> On Tue, Oct 14, 2025 at 1:04 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Oct 13, 2025 at 9:09 PM Barry Song <21cnbao@...il.com> wrote:
> > >
> > > On Tue, Oct 14, 2025 at 5:56 AM Matthew Wilcox <willy@...radead.org> wrote:
> > > >
> > > > On Mon, Oct 13, 2025 at 06:16:36PM +0800, Barry Song wrote:
> > > > > On phones, we have observed significant phone heating when running apps
> > > > > with high network bandwidth. This is caused by the network stack frequently
> > > > > waking kswapd for order-3 allocations. As a result, memory reclamation becomes
> > > > > constantly active, even though plenty of memory is still available for network
> > > > > allocations which can fall back to order-0.
> > > >
> > > > I think we need to understand what's going on here a whole lot more than
> > > > this!
> > > >
> > > > So, we try to do an order-3 allocation.  kswapd runs and ... succeeds in
> > > > creating order-3 pages?  Or fails to?
> > > >
> > >
> > > Our team observed that most of the time we successfully obtain order-3
> > > memory, but the cost is excessive memory reclamation, since we end up
> > > over-reclaiming order-0 pages that could have remained in memory.
> > >
> > > > If it fails, that's something we need to sort out.
> > > >
> > > > If it succeeds, now we have several order-3 pages, great.  But where do
> > > > they all go that we need to run kswapd again?
> > >
> > > The network app keeps running and continues to issue new order-3 allocation
> > > requests, so those few order-3 pages won’t be enough to satisfy the
> > > continuous demand.
> >
> > These pages are freed as order-3 pages, and should replenish the buddy
> > as if nothing happened.
>
> Ideally, that would be the case if the workload were simple. However, the
> system may have many other processes and kernel drivers running
> simultaneously, also consuming memory from the buddy allocator and possibly
> taking the replenished pages. As a result, we can still observe multiple
> kswapd wakeups and instances of over-reclamation caused by the network
> stack’s high-order allocations.
>
> >
> > I think you are missing something to control how much memory  can be
> > pushed on each TCP socket ?
> >
> > What is tcp_wmem on your phones ? What about tcp_mem ?
> >
> > Have you looked at /proc/sys/net/ipv4/tcp_notsent_lowat
>
> # cat /proc/sys/net/ipv4/tcp_wmem
> 524288  1048576 6710886

Ouch. That is insane tcp_wmem[0] .

Please stick to 4096, or risk OOM of various sorts.

>
> # cat /proc/sys/net/ipv4/tcp_notsent_lowat
> 4294967295
>
> Any thoughts on these settings?

Please look at
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_notsent_lowat - UNSIGNED INTEGER
A TCP socket can control the amount of unsent bytes in its write queue,
thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
reports POLLOUT events if the amount of unsent bytes is below a per
socket value, and if the write queue is not full. sendmsg() will
also not add new buffers if the limit is hit.

This global variable controls the amount of unsent data for
sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
to the global variable has immediate effect.


Setting this sysctl to 2MB can effectively reduce the amount of memory
in TCP write queues by 66 %,
or allow you to increase tcp_wmem[2] so that only flows needing big
BDP can get it.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ