[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63fa6bc2.6afc.19b25f618ad.Coremail.00107082@163.com>
Date: Tue, 16 Dec 2025 15:00:43 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Mal Haak" <malcolm@...k.id.au>
Cc: "Viacheslav Dubeyko" <Slava.Dubeyko@....com>,
"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>,
"Xiubo Li" <xiubli@...hat.com>,
"idryomov@...il.com" <idryomov@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"surenb@...gle.com" <surenb@...gle.com>
Subject: Re: Possible memory leak in 6.17.7
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@...k.id.au> wrote:
>On Mon, 15 Dec 2025 19:42:56 +0000
>Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
>
>> Hi Mal,
>>
><SNIP>
>>
>> Thanks a lot for reporting the issue. Finally, I can see the
>> discussion in email list. :) Are you working on the patch with the
>> fix? Should we wait for the fix or I need to start the issue
>> reproduction and investigation? I am simply trying to avoid patches
>> collision and, also, I have multiple other issues for the fix in
>> CephFS kernel client. :)
>>
>> Thanks,
>> Slava.
>
>Hello,
>
>Unfortunately creating a patch is just outside my comfort zone, I've
>lived too long in Lustre land.
Hi, just out of curiosity, have you narrowed down the caller of __filemap_get_folio
causing the memory problem? Or do you have trouble applying the debug patch for
memory allocation profiling?
David
>
>I've have been trying to narrow down a consistent reproducer that's as
>fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
>haven't got it quite as fast. I think the dd workload is too well
>behaved.
>
>I can confirm the issue appeared in the major patch set that was
>applied as part of the 6.15 kernel. So during the more complete pages
>to folios switch and that nothing has changed in the bug behaviour since
>then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
>and didn't see any changes post 6.15 that looked like they would impact
>the bug behavior.
>
>Again, I'm not super familiar with the CephFS code but to hazard a
>guess, but I think that the web download workload triggers things faster
>suggests that unaligned writes might make things worse. But again, I'm
>not 100% sure. I can't find a reproducer as fast as downloading a
>dataset. Rsync of lots and lots of tiny files is a tad faster than the
>dd case.
>
>I did see some changes in ceph_check_page_before_write where the
>previous code unlocked pages and then continued where as the changed
>folio code just returns ENODATA and doesn't unlock anything with most
>of the rest of the logic unchanged. This might be perfectly fine, but
>in my, admittedly limited, reading of the code I couldn't figure out
>where anything that was locked prior to this being called would get
>unlocked like it did prior to the change. Again, I could be miles off
>here and one of the bulk reclaim/unlock passes that was added might be
>cleaning this up correctly or some other functional change might take
>care of this, but it looks to be potentially in the code path I'm
>excising and it has had some unlock logic changed.
>
>I've spent most of my time trying to find a solid quick reproducer. Not
>that it takes long to start leaking folios, but I wanted something that
>aggressively triggered it so a small vm would oom quickly and when
>combined with crash_on_oom it could potentially be used for regression
>testing by way of "did vm crash?".
>
>I'm not sure if it will super help, but I'll provide what details I can
>about the actual workload that really sets it off. It's a python based
>tool for downloading datasets. Datasets are split into N chunks and the
>tool downloads them in parallel 100 at a time until all N chunks are
>down. The compressed dataset is then unpacked and reassembled for
>use with workloads.
>
>This is replicating a common home folder usecase in HPC. CephFS is very
>attractive for home folders due to it's "NFS-like" utility and
>performance. And many tools use a similar method for fetching large
>datasets. Tools are frequently written in python or go.
>
>None of my customers have hit this yet, not have any enterprise
>customers as none have moved to a new enough kernel yet due to slow
>upgrade cycles. Even Proxmox have only just started testing on a kernel
>version > 6.14.
>
>I'm more than happy to help however I can with testing. I can run
>instrumented kernels or test patches or whatever you need. I am sorry I
>haven't been able to produce a super clean, fast reproducer (my test
>cluster at home is all spinners and only 500TB usable). But I figured I
>needed to get the word out asap as distros and soon customers are going
>to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
>marches on. Especially those wanting to take full advantage of CacheFS
>and encryption functionality.
>
>Again thanks for looking at this and do reach out if I can help in
>anyway. I am in the ceph slack if it's faster to reach out that way.
>
>Regards
>
>Mal Haak
Powered by blists - more mailing lists