[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e21ea030-b05f-42e6-b479-b3e0789b9d97@linux.dev>
Date: Fri, 7 Nov 2025 18:09:29 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>,
"Garg, Shivank" <shivankg@....com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
zokeefe@...gle.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On 2025/11/7 17:12, David Hildenbrand (Red Hat) wrote:
>
>>
>> 5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the
>> executable, using the address
>> range obtained from /proc/self/maps. IIUC, this should benefit
>> applications by reducing ITLB pressure.
>>
>> I agree with the suggestions to either Return EAGAIN instead of EINVAL
>> or At minimum, document the
>> EINVAL return for dirty pages. I'm happy to work on a patch.
>
> Of course, we could detect that we are in MADV_COLLAPSE and simply
> writeback ourselves. After all,
> user space asked for a collapse, and it's not khugepaged that will
> simple revisit it later.
>
> I did something similar in
>
> commit ab73b29efd36f8916c6cc9954e912c4723c9a1b0
> Author: David Hildenbrand <david@...hat.com>
> Date: Fri May 16 14:39:46 2025 +0200
>
> s390/uv: Improve splitting of large folios that cannot be split
> while dirty
> Currently, starting a PV VM on an iomap-based filesystem with large
> folio support, such as XFS, will not work. We'll be stuck in
> unpack_one()->gmap_make_secure(), because we can't seem to make
> progress
> splitting the large folio.
>
> Where I effectively use filemap_write_and_wait_range().
>
> It could be used early to writeback the whole range to collapse once,
> possibly.
Exactly!
Since MADV_COLLAPSE is a best-effort thing, having the kernel use
something like filemap_write_and_wait_range() to writeback the pages
before collapsing is likely what users would expect.
Anyway, they just want to get a THP, whether the pages are dirty or
clean :)
Powered by blists - more mailing lists