lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <77a54f63-f5da-42a2-b24d-5c8a0f41d1e6@gmail.com>
Date: Thu, 6 Nov 2025 22:05:41 +0100
From: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Ryan Roberts <ryan.roberts@....com>, "Garg, Shivank" <shivankg@....com>,
 Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
 Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
 Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
 Jann Horn <jannh@...gle.com>, zokeefe@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
 text pages

On 06.11.25 18:17, Lorenzo Stoakes wrote:
> On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
>> * Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
>>> On 06/11/2025 12:16, Garg, Shivank wrote:
>>>> Hi All,
>>>>
>>>> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
>>>> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
>>>> current behavior and improvements.
>>>>
>>>> Problem:
>>>> When attempting to collapse read-only file-backed TEXT sections into THPs
>>>> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
>>>> are marked dirty.
>>>> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>>>>
>>>> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
>>>> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>>>>
>>>> Root Cause:
>>>> The failure occurs in mm/khugepaged.c:collapse_file():
>>>> } else if (folio_test_dirty(folio)) {
>>>>      /*
>>>>       * khugepaged only works on read-only fd,
>>>>       * so this page is dirty because it hasn't
>>>>       * been flushed since first write. There
>>>>       * won't be new dirty pages.
>>>>       *
>>>>       * Trigger async flush here and hope the
>>>>       * writeback is done when khugepaged
>>>>       * revisits this page.
>>>>       */
>>>>      xas_unlock_irq(&xas);
>>>>      filemap_flush(mapping);
>>>>      result = SCAN_FAIL;
>>>>      goto xa_unlocked;
>>>> }
>>>>
>>>> Why the text pages are dirty?
>>>
>>> This is the real question to to answer, I think...
>>
>> Agree with Ryan here, let's stop things from being marked dirty if they
>> are not.
> 
> Hmm I wonder if we have some broken assumptions in khugepaged for MAP_PRIVATE
> mappings.
> 
> collapse_single_pmd()
> -> collapse_scan_file() if not vma_is_anonymous() (it won't be)
> -> collapse_file()
> -> the snippet above.
> 
> But that could be running on an anon folio...
> 
> Yup given it's CONFIG_READY_ONLY_THP_FOR_FS that is strange. We are confounding
> expectations here surely?
> 
> Presumably it's because these are MAP_PRIVATE mappings, so this is an anon folio
> but then collapse_file() goes into the snippet above and gets very confused.
> 
> Do we need to add a folio_test_anon() here?
> 
> Unless I'm missing something... (very possible, am only glancing over the code
> here)

collapse_file() operates exclusively on the pagecache.

I think we only start working on the actual page tables when calling
retract_page_tables().

In there, we have this code, when iterating over page tables belonging
to the mapping:

		/*
		 * The lock of new_folio is still held, we will be blocked in
		 * the page fault path, which prevents the pte entries from
		 * being set again. So even though the old empty PTE page may be
		 * concurrently freed and a new PTE page is filled into the pmd
		 * entry, it is still empty and can be removed.
		 *
		 * So here we only need to recheck if the state of pmd entry
		 * still meets our requirements, rather than checking pmd_same()
		 * like elsewhere.
		 */
		if (check_pmd_state(pmd) != SCAN_SUCCEED)
			goto drop_pml;
		ptl = pte_lockptr(mm, pmd);
		if (ptl != pml)
			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);

		/*
		 * Huge page lock is still held, so normally the page table
		 * must remain empty; and we have already skipped anon_vma
		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
		 * held, it is still possible for a racing userfaultfd_ioctl()
		 * to have inserted ptes or markers.  Now that we hold ptlock,
		 * repeating the anon_vma check protects from one category,
		 * and repeating the userfaultfd_wp() check from another.
		 */
		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
			pmdp_get_lockless_sync();
			success = true;
		}

Given !vma->anon_vma, we cannot have anon folios in there.

Given !userfaultfd_wp(vma), we cannot have uffd-wp markers in there.

Given that all folios in the range we are collapsing where unmapped, we cannot have
them mapped there.

So the conclusion is that the page table must be empty and can be removed.


Could guard markers be in there?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ