lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251120122351.231513e1@xps15mal>
Date: Thu, 20 Nov 2025 12:23:51 +1000
From: Mal Haak <malcolm@...k.id.au>
To: linux-kernel@...r.kernel.org
Subject: Re: Possible memory leak in 6.17.7

On Mon, 10 Nov 2025 18:20:08 +1000
Mal Haak <malcolm@...k.id.au> wrote:

> Hello,
> 
> I have found a memory leak in 6.17.7 but I am unsure how to track it
> down effectively.
> 
> I am running a server that has a heavy read/write workload to a cephfs
> file system. It is a VM. 
> 
> Over time it appears that the non-cache useage of kernel dynamic
> memory increases. The kernel seems to think the pages are reclaimable
> however nothing appears to trigger the reclaim. This leads to
> workloads getting killed via oomkiller. 
> 
> smem -wp output:
> 
> Area                           Used      Cache   Noncache 
> firmware/hardware             0.00%      0.00%      0.00% 
> kernel image                  0.00%      0.00%      0.00% 
> kernel dynamic memory        88.21%     36.25%     51.96% 
> userspace memory              9.49%      0.15%      9.34% 
> free memory                   2.30%      2.30%      0.00% 
> 
> free -h output:
> 
>        total  used   free   shared  buff/cache available 
> Mem:   31Gi   3.6Gi  500Mi  4.0Mi   11Gi      27Gi 
> Swap:  4.0Gi  179Mi  3.8Gi
> 
> Reverting to the previous LTS fixes the issue
> 
> smem -wp output:
> Area                           Used      Cache   Noncache 
> firmware/hardware             0.00%      0.00%      0.00% 
> kernel image                  0.00%      0.00%      0.00% 
> kernel dynamic memory        80.22%     79.32%      0.90% 
> userspace memory             10.48%      0.20%     10.28% 
> free memory                   9.30%      9.30%      0.00% 
> 
I have more information. The leaking of kernel memory only starts once
there is a lot of data in buffers/cache. And only once it's been in
that state for several hours. 

Currently in my search for a reproducer I have found that
downloading then seeding of multiple torrents of linux
distribution ISO's will replicate the issue. But it only begins leaking
at around the 6-9 hour mark. 

It does not appear to be dependant on cephfs; but due to it's use of
sockets I believe this is making the situation worse. 

I cannot replicate it at all with the LTS kernel release but it does
look like the current RC releases do have this issue. 

I was looking at doing a kernel build with CONFIG_DEBUG_KMEMLEAK
enabled and will if it's thought this would find the issue. However as
the memory usage is still somewhat tracked and obviously marked as
reclaimable it feels more like something in the reclaim logic is
getting broken. 

I do wonder if due to it only happening after ram is mostly consumed by
cache, and even then only if it has been that way for hours, if the
issue is memory fragmentation related. 

Regardless, some advice on how to narrow this down faster than a git
bisect as 9hrs to even confirm replication of the issue makes git
bisect painfully slow.

Thanks in advance

Mal Haak


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ