lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <868d1103-df00-43dd-b04b-21568309445e@lucifer.local>
Date: Fri, 7 Nov 2025 10:09:41 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Ryan Roberts <ryan.roberts@....com>,
        "Garg, Shivank" <shivankg@....com>,
        Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Nico Pache <npache@...hat.com>, Dev Jain <dev.jain@....com>,
        Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        zokeefe@...gle.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
 text pages

On Thu, Nov 06, 2025 at 10:05:41PM +0100, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
> > On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
> > > * Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
> > > > On 06/11/2025 12:16, Garg, Shivank wrote:
> > > > > Hi All,
> > > > >
> > > > > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > > > > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > > > > current behavior and improvements.
> > > > >
> > > > > Problem:
> > > > > When attempting to collapse read-only file-backed TEXT sections into THPs
> > > > > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > > > > are marked dirty.
> > > > > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> > > > >
> > > > > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > > > > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> > > > >
> > > > > Root Cause:
> > > > > The failure occurs in mm/khugepaged.c:collapse_file():
> > > > > } else if (folio_test_dirty(folio)) {
> > > > >      /*
> > > > >       * khugepaged only works on read-only fd,
> > > > >       * so this page is dirty because it hasn't
> > > > >       * been flushed since first write. There
> > > > >       * won't be new dirty pages.
> > > > >       *
> > > > >       * Trigger async flush here and hope the
> > > > >       * writeback is done when khugepaged
> > > > >       * revisits this page.
> > > > >       */
> > > > >      xas_unlock_irq(&xas);
> > > > >      filemap_flush(mapping);
> > > > >      result = SCAN_FAIL;
> > > > >      goto xa_unlocked;
> > > > > }
> > > > >
> > > > > Why the text pages are dirty?
> > > >
> > > > This is the real question to to answer, I think...
> > >
> > > Agree with Ryan here, let's stop things from being marked dirty if they
> > > are not.
> >
> > Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> > mappings.
> >
> > collapse_single_pmd()
> > -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> > -> collapse_file()
> > -> the snippet above.

Sorry I was looking at Nico's series these functions aren't correct as of
mm-new atm.

This should be:

madvise_collapse()
-> hpage_collapse_scan_file()
-> collapse_file()


> >
> > But that could be running on an anon folio...
> >
> > Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> > expectations here surely?
> >
> > Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> > but then collapse_file() goes into the snippet above and gets very confused.
> >
> > Do we need to add a folio_test_anon() here?
> >
> > Unless I'm missing something... (very possible, am only glancing over the code
> > here)
>
> collapse_file() operates exclusively on the pagecache.

Right you're correct:

	XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);

	...

		folio = xas_load(&xas);

etc. etc.

And with the code you mention below, markers + MAP_PRIVATE are handled
correctly.

This THP code :) such fun.

So yeah this is as simple as the folio is literally just dirty.

And:

			} else if (folio_test_dirty(folio)) {
				/*
				 * khugepaged only works on read-only fd,
				 * so this page is dirty because it hasn't
				 * been flushed since first write. There
				 * won't be new dirty pages.
				 *
				 * Trigger async flush here and hope the
				 * writeback is done when khugepaged
				 * revisits this page.
				 *
				 * This is a one-off situation. We are not
				 * forcing writeback in loop.
				 */
				xas_unlock_irq(&xas);
				filemap_flush(mapping);
Since we do an async flush here ----^

This is why a retry (assuming writeback completed) works.

				result = SCAN_FAIL;
				goto xa_unlocked;
			} else if (folio_test_writeback(folio)) {

>
> I think we only start working on the actual page tables when calling
> retract_page_tables().

Yup.

>
> In there, we have this code, when iterating over page tables belonging
> to the mapping:
>
> 		/*
> 		 * The lock of new_folio is still held, we will be blocked in
> 		 * the page fault path, which prevents the pte entries from
> 		 * being set again. So even though the old empty PTE page may be
> 		 * concurrently freed and a new PTE page is filled into the pmd
> 		 * entry, it is still empty and can be removed.
> 		 *
> 		 * So here we only need to recheck if the state of pmd entry
> 		 * still meets our requirements, rather than checking pmd_same()
> 		 * like elsewhere.
> 		 */
> 		if (check_pmd_state(pmd) != SCAN_SUCCEED)
> 			goto drop_pml;
> 		ptl = pte_lockptr(mm, pmd);
> 		if (ptl != pml)
> 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>
> 		/*
> 		 * Huge page lock is still held, so normally the page table
> 		 * must remain empty; and we have already skipped anon_vma
> 		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> 		 * held, it is still possible for a racing userfaultfd_ioctl()
> 		 * to have inserted ptes or markers.  Now that we hold ptlock,
> 		 * repeating the anon_vma check protects from one category,
> 		 * and repeating the userfaultfd_wp() check from another.
> 		 */
> 		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 			pmdp_get_lockless_sync();
> 			success = true;
> 		}
>
> Given !vma->anon_vma, we cannot have anon folios in there.
>
> Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.

Right.

>
> Given that all folios in the range we are collapsing where unmapped, we cannot have
> them mapped there.
>
> So the conclusion is that the page table must be empty and can be removed.
>
>
> Could guard markers be in there?

Right now guard markers only exist if vma->anon_vma is set, including the
file-backed case.

But for file-backed guard regions after my VMA sticky series this won't be the
case any more :)

So I had better go change that...

I hate that we have open-coded stuff all over the place that makes assumptions
like this.

This also ignores any other marker types. How I hate the uffd wp implementation.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ