linux-kernel - Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed text pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a41b57f6-08d6-4af0-8383-7ba3b90c1acb@amd.com>
Date: Fri, 7 Nov 2025 18:16:12 +0530
From: "Garg, Shivank" <shivankg@....com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>,
 Lance Yang <lance.yang@...ux.dev>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Ryan Roberts <ryan.roberts@....com>,
 Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
 Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
 Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
 Jann Horn <jannh@...gle.com>, zokeefe@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
 text pages

On 11/7/2025 3:40 PM, Lorenzo Stoakes wrote:
> On Fri, Nov 07, 2025 at 10:12:02AM +0100, David Hildenbrand (Red Hat) wrote:
>>
>>>
>>> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
>>>     range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
>>>
>>> I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
>>> EINVAL return for dirty pages. I'm happy to work on a patch.
>>
>> Of course, we could detect that we are in MADV_COLLAPSE and simply writeback ourselves. After all,
>> user space asked for a collapse, and it's not khugepaged that will simple revisit it later.
>>
>> I did something similar in
>>
>> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
>> Author: David Hildenbrand <david@...hat.com>
>> Date:   Fri May 16 14:39:46 2025 +0200
>>
>>     s390/uv: Improve splitting of large folios that cannot be split while dirty
>>     Currently, starting a PV VM on an iomap-based filesystem with large
>>     folio support, such as XFS, will not work. We'll be stuck in
>>     unpack_one()->gmap_make_secure(), because we can't seem to make progress
>>     splitting the large folio.
>>
>> Where I effectively use filemap_write_and_wait_range().
>>
>> It could be used early to writeback the whole range to collapse once, possibly.
> 
> I agree, let's just do a sync flush unconditionally and fix this that way.
> 
> This is simpler than I thought, the key bit of information is that we have
> freshly written the executable so it sits in the page cache but dirty.
> 
> Thanks, Lorenzo

Thanks David for sharing the commit. This worked for me and fix is simple.

+        if (!is_shmem && !cc->is_khugepaged && mapping_can_writeback(mapping)) {
+                loff_t range_start = start << PAGE_SHIFT;
+                loff_t range_end = (end << PAGE_SHIFT) - 1;
+                int ret;
+
+                ret = filemap_write_and_wait_range(mapping, range_start, range_end);
+                if (ret) {
+                        result = SCAN_FAIL;
+                        goto out;
+                }
+        }

I'll do some more testing and post a cleaned-up version with proper comments; rebase on mm-next.
Thanks,
Shivank