[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <868d1103-df00-43dd-b04b-21568309445e@lucifer.local>
Date: Fri, 7 Nov 2025 10:09:41 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Ryan Roberts <ryan.roberts@....com>,
"Garg, Shivank" <shivankg@....com>,
Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Nico Pache <npache@...hat.com>, Dev Jain <dev.jain@....com>,
Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
zokeefe@...gle.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On Thu, Nov 06, 2025 at 10:05:41PM +0100, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
> > On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
> > > * Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
> > > > On 06/11/2025 12:16, Garg, Shivank wrote:
> > > > > Hi All,
> > > > >
> > > > > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > > > > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > > > > current behavior and improvements.
> > > > >
> > > > > Problem:
> > > > > When attempting to collapse read-only file-backed TEXT sections into THPs
> > > > > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > > > > are marked dirty.
> > > > > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > > > >
> > > > > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > > > > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > > > >
> > > > > Root Cause:
> > > > > The failure occurs in mm/khugepaged.c:collapse_file():
> > > > > } else if (folio_test_dirty(folio)) {
> > > > > /*
> > > > > * khugepaged only works on read-only fd,
> > > > > * so this page is dirty because it hasn't
> > > > > * been flushed since first write. There
> > > > > * won't be new dirty pages.
> > > > > *
> > > > > * Trigger async flush here and hope the
> > > > > * writeback is done when khugepaged
> > > > > * revisits this page.
> > > > > */
> > > > > xas_unlock_irq(&xas);
> > > > > filemap_flush(mapping);
> > > > > result = SCAN_FAIL;
> > > > > goto xa_unlocked;
> > > > > }
> > > > >
> > > > > Why the text pages are dirty?
> > > >
> > > > This is the real question to to answer, I think...
> > >
> > > Agree with Ryan here, let's stop things from being marked dirty if they
> > > are not.
> >
> > Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> > mappings.
> >
> > collapse_single_pmd()
> > -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> > -> collapse_file()
> > -> the snippet above.
Sorry I was looking at Nico's series these functions aren't correct as of
mm-new atm.
This should be:
madvise_collapse()
-> hpage_collapse_scan_file()
-> collapse_file()
> >
> > But that could be running on an anon folio...
> >
> > Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> > expectations here surely?
> >
> > Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> > but then collapse_file() goes into the snippet above and gets very confused.
> >
> > Do we need to add a folio_test_anon() here?
> >
> > Unless I'm missing something... (very possible, am only glancing over the code
> > here)
>
> collapse_file() operates exclusively on the pagecache.
Right you're correct:
XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
...
folio = xas_load(&xas);
etc. etc.
And with the code you mention below, markers + MAP_PRIVATE are handled
correctly.
This THP code :) such fun.
So yeah this is as simple as the folio is literally just dirty.
And:
} else if (folio_test_dirty(folio)) {
/*
* khugepaged only works on read-only fd,
* so this page is dirty because it hasn't
* been flushed since first write. There
* won't be new dirty pages.
*
* Trigger async flush here and hope the
* writeback is done when khugepaged
* revisits this page.
*
* This is a one-off situation. We are not
* forcing writeback in loop.
*/
xas_unlock_irq(&xas);
filemap_flush(mapping);
Since we do an async flush here ----^
This is why a retry (assuming writeback completed) works.
result = SCAN_FAIL;
goto xa_unlocked;
} else if (folio_test_writeback(folio)) {
>
> I think we only start working on the actual page tables when calling
> retract_page_tables().
Yup.
>
> In there, we have this code, when iterating over page tables belonging
> to the mapping:
>
> /*
> * The lock of new_folio is still held, we will be blocked in
> * the page fault path, which prevents the pte entries from
> * being set again. So even though the old empty PTE page may be
> * concurrently freed and a new PTE page is filled into the pmd
> * entry, it is still empty and can be removed.
> *
> * So here we only need to recheck if the state of pmd entry
> * still meets our requirements, rather than checking pmd_same()
> * like elsewhere.
> */
> if (check_pmd_state(pmd) != SCAN_SUCCEED)
> goto drop_pml;
> ptl = pte_lockptr(mm, pmd);
> if (ptl != pml)
> spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> /*
> * Huge page lock is still held, so normally the page table
> * must remain empty; and we have already skipped anon_vma
> * and userfaultfd_wp() vmas. But since the mmap_lock is not
> * held, it is still possible for a racing userfaultfd_ioctl()
> * to have inserted ptes or markers. Now that we hold ptlock,
> * repeating the anon_vma check protects from one category,
> * and repeating the userfaultfd_wp() check from another.
> */
> if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> pmdp_get_lockless_sync();
> success = true;
> }
>
> Given !vma->anon_vma, we cannot have anon folios in there.
>
> Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.
Right.
>
> Given that all folios in the range we are collapsing where unmapped, we cannot have
> them mapped there.
>
> So the conclusion is that the page table must be empty and can be removed.
>
>
> Could guard markers be in there?
Right now guard markers only exist if vma->anon_vma is set, including the
file-backed case.
But for file-backed guard regions after my VMA sticky series this won't be the
case any more :)
So I had better go change that...
I hate that we have open-coded stuff all over the place that makes assumptions
like this.
This also ignores any other marker types. How I hate the uffd wp implementation.
Powered by blists - more mailing lists