[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a41b57f6-08d6-4af0-8383-7ba3b90c1acb@amd.com>
Date: Fri, 7 Nov 2025 18:16:12 +0530
From: "Garg, Shivank" <shivankg@....com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>,
Lance Yang <lance.yang@...ux.dev>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, zokeefe@...gle.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On 11/7/2025 3:40 PM, Lorenzo Stoakes wrote:
> On Fri, Nov 07, 2025 at 10:12:02AM +0100, David Hildenbrand (Red Hat) wrote:
>>
>>>
>>> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
>>> range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
>>>
>>> I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
>>> EINVAL return for dirty pages. I'm happy to work on a patch.
>>
>> Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
>> user space asked for a collapse, and it's not khugepaged that will simple revisit it later.
>>
>> I did something similar in
>>
>> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
>> Author: David Hildenbrand <david@...hat.com>
>> Date: Fri May 16 14:39:46 2025 +0200
>>
>> s390/uv: Improve splitting of large folios that cannot be split while dirty
>> Currently, starting a PV VM on an iomap-based filesystem with large
>> folio support, such as XFS, will not work. We'll be stuck in
>> unpack_one()->gmap_make_secure(), because we can't seem to make progress
>> splitting the large folio.
>>
>> Where I effectively use filemap_write_and_wait_range().
>>
>> It could be used early to writeback the whole range to collapse once, possibly.
>
> I agree, let's just do a sync flush unconditionally and fix this that way.
>
> This is simpler than I thought, the key bit of information is that we have
> freshly written the executable so it sits in the page cache but dirty.
>
> Thanks, Lorenzo
Thanks David for sharing the commit. This worked for me and fix is simple.
+ if (!is_shmem && !cc->is_khugepaged && mapping_can_writeback(mapping)) {
+ loff_t range_start = start << PAGE_SHIFT;
+ loff_t range_end = (end << PAGE_SHIFT) - 1;
+ int ret;
+
+ ret = filemap_write_and_wait_range(mapping, range_start, range_end);
+ if (ret) {
+ result = SCAN_FAIL;
+ goto out;
+ }
+ }
I'll do some more testing and post a cleaned-up version with proper comments; rebase on mm-next.
Thanks,
Shivank
Powered by blists - more mailing lists