lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251216112647.39ac2295@xps15mal>
Date: Tue, 16 Dec 2025 11:26:47 +1000
From: Mal Haak <malcolm@...k.id.au>
To: Viacheslav Dubeyko <Slava.Dubeyko@....com>
Cc: "00107082@....com" <00107082@....com>, "ceph-devel@...r.kernel.org"
 <ceph-devel@...r.kernel.org>, Xiubo Li <xiubli@...hat.com>,
 "idryomov@...il.com" <idryomov@...il.com>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>, "surenb@...gle.com" <surenb@...gle.com>
Subject: Re: RRe: Possible memory leak in 6.17.7

On Mon, 15 Dec 2025 19:42:56 +0000
Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:

> Hi Mal,
> 
<SNIP> 
> 
> Thanks a lot for reporting the issue. Finally, I can see the
> discussion in email list. :) Are you working on the patch with the
> fix? Should we wait for the fix or I need to start the issue
> reproduction and investigation? I am simply trying to avoid patches
> collision and, also, I have multiple other issues for the fix in
> CephFS kernel client. :)
> 
> Thanks,
> Slava.

Hello,

Unfortunately creating a patch is just outside my comfort zone, I've
lived too long in Lustre land.

I've have been trying to narrow down a consistent reproducer that's as
fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
haven't got it quite as fast. I think the dd workload is too well
behaved. 

I can confirm the issue appeared in the major patch set that was
applied as part of the 6.15 kernel. So during the more complete pages
to folios switch and that nothing has changed in the bug behaviour since
then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
and didn't see any changes post 6.15 that looked like they would impact
the bug behavior. 

Again, I'm not super familiar with the CephFS code but to hazard a
guess, but I think that the web download workload triggers things faster
suggests that unaligned writes might make things worse. But again, I'm
not 100% sure. I can't find a reproducer as fast as downloading a
dataset. Rsync of lots and lots of tiny files is a tad faster than the
dd case.

I did see some changes in ceph_check_page_before_write where the
previous code unlocked pages and then continued where as the changed
folio code just returns ENODATA and doesn't unlock anything with most
of the rest of the logic unchanged. This might be perfectly fine, but
in my, admittedly limited, reading of the code I couldn't figure out
where anything that was locked prior to this being called would get
unlocked like it did prior to the change. Again, I could be miles off
here and one of the bulk reclaim/unlock passes that was added might be
cleaning this up correctly or some other functional change might take
care of this, but it looks to be potentially in the code path I'm
excising and it has had some unlock logic changed. 

I've spent most of my time trying to find a solid quick reproducer. Not
that it takes long to start leaking folios, but I wanted something that
aggressively triggered it so a small vm would oom quickly and when
combined with crash_on_oom it could potentially be used for regression
testing by way of "did vm crash?".

I'm not sure if it will super help, but I'll provide what details I can
about the actual workload that really sets it off. It's a python based
tool for downloading datasets. Datasets are split into N chunks and the
tool downloads them in parallel 100 at a time until all N chunks are
down. The compressed dataset is then unpacked and reassembled for
use with workloads. 

This is replicating a common home folder usecase in HPC. CephFS is very
attractive for home folders due to it's "NFS-like" utility and
performance. And many tools use a similar method for fetching large
datasets. Tools are frequently written in python or go. 

None of my customers have hit this yet, not have any enterprise
customers as none have moved to a new enough kernel yet due to slow
upgrade cycles. Even Proxmox have only just started testing on a kernel
version > 6.14. 

I'm more than happy to help however I can with testing. I can run
instrumented kernels or test patches or whatever you need. I am sorry I
haven't been able to produce a super clean, fast reproducer (my test
cluster at home is all spinners and only 500TB usable). But I figured I
needed to get the word out asap as distros and soon customers are going
to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
marches on. Especially those wanting to take full advantage of CacheFS
and encryption functionality. 

Again thanks for looking at this and do reach out if I can help in
anyway. I am in the ceph slack if it's faster to reach out that way.

Regards

Mal Haak

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ