linux-kernel - Re: Possible memory leak in 6.17.7

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251206082336.6e04a1ac@xps15mal>
Date: Sat, 6 Dec 2025 08:23:36 +1000
From: Mal Haak <malcolm@...k.id.au>
To: linux-kernel@...r.kernel.org
Subject: Re: Possible memory leak in 6.17.7

I have a reproducer. It's slow but it works.

I kept rsync running for 2 days by moving 5TB of files.

smem -wp

Area                           Used      Cache   Noncache 
firmware/hardware             0.00%      0.00%      0.00% 
kernel image                  0.00%      0.00%      0.00% 
kernel dynamic memory        98.81%      1.69%     97.13% 
userspace memory              0.08%      0.05%      0.03% 
free memory                   1.11%      1.11%      0.00% 
[root@...neltest ~]# uname -a
Linux kerneltest 6.18.0-1-mainline #1 SMP PREEMPT_DYNAMIC Tue, 11
Nov 2025 00:02:22 +0000 x86_64 GNU/Linux

The issue is in 6.18.

On Thu, 20 Nov 2025 12:23:51 +1000
Mal Haak <malcolm@...k.id.au> wrote:

> On Mon, 10 Nov 2025 18:20:08 +1000
> Mal Haak <malcolm@...k.id.au> wrote:
> 
> > Hello,
> > 
> > I have found a memory leak in 6.17.7 but I am unsure how to track it
> > down effectively.
> > 
> > I am running a server that has a heavy read/write workload to a
> > cephfs file system. It is a VM. 
> > 
> > Over time it appears that the non-cache useage of kernel dynamic
> > memory increases. The kernel seems to think the pages are
> > reclaimable however nothing appears to trigger the reclaim. This
> > leads to workloads getting killed via oomkiller. 
> > 
> > smem -wp output:
> > 
> > Area                           Used      Cache   Noncache 
> > firmware/hardware             0.00%      0.00%      0.00% 
> > kernel image                  0.00%      0.00%      0.00% 
> > kernel dynamic memory        88.21%     36.25%     51.96% 
> > userspace memory              9.49%      0.15%      9.34% 
> > free memory                   2.30%      2.30%      0.00% 
> > 
> > free -h output:
> > 
> >        total  used   free   shared  buff/cache available 
> > Mem:   31Gi   3.6Gi  500Mi  4.0Mi   11Gi      27Gi 
> > Swap:  4.0Gi  179Mi  3.8Gi
> > 
> > Reverting to the previous LTS fixes the issue
> > 
> > smem -wp output:
> > Area                           Used      Cache   Noncache 
> > firmware/hardware             0.00%      0.00%      0.00% 
> > kernel image                  0.00%      0.00%      0.00% 
> > kernel dynamic memory        80.22%     79.32%      0.90% 
> > userspace memory             10.48%      0.20%     10.28% 
> > free memory                   9.30%      9.30%      0.00% 
> >   
> I have more information. The leaking of kernel memory only starts once
> there is a lot of data in buffers/cache. And only once it's been in
> that state for several hours. 
> 
> Currently in my search for a reproducer I have found that
> downloading then seeding of multiple torrents of linux
> distribution ISO's will replicate the issue. But it only begins
> leaking at around the 6-9 hour mark. 
> 
> It does not appear to be dependant on cephfs; but due to it's use of
> sockets I believe this is making the situation worse. 
> 
> I cannot replicate it at all with the LTS kernel release but it does
> look like the current RC releases do have this issue. 
> 
> I was looking at doing a kernel build with CONFIG_DEBUG_KMEMLEAK
> enabled and will if it's thought this would find the issue. However as
> the memory usage is still somewhat tracked and obviously marked as
> reclaimable it feels more like something in the reclaim logic is
> getting broken. 
> 
> I do wonder if due to it only happening after ram is mostly consumed
> by cache, and even then only if it has been that way for hours, if the
> issue is memory fragmentation related. 
> 
> Regardless, some advice on how to narrow this down faster than a git
> bisect as 9hrs to even confirm replication of the issue makes git
> bisect painfully slow.
> 
> Thanks in advance
> 
> Mal Haak
>