[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkqvXsFfziYU6A_LXfF2UQHkmNHqyT05P+dTav3mi4b0hA@mail.gmail.com>
Date: Thu, 6 Nov 2025 12:32:58 -0800
From: Yang Shi <shy828301@...il.com>
To: "Garg, Shivank" <shivankg@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>,
Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>, zokeefe@...gle.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On Thu, Nov 6, 2025 at 7:16 AM Garg, Shivank <shivankg@....com> wrote:
>
> Hi All,
>
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
>
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
> /*
> * khugepaged only works on read-only fd,
> * so this page is dirty because it hasn't
> * been flushed since first write. There
> * won't be new dirty pages.
> *
> * Trigger async flush here and hope the
> * writeback is done when khugepaged
> * revisits this page.
> */
> xas_unlock_irq(&xas);
> filemap_flush(mapping);
> result = SCAN_FAIL;
> goto xa_unlocked;
> }
>
> Why the text pages are dirty?
I'm not sure how you did the test, but if you ran the program right
after it was built, it may be possible the background writeback has
not kicked in yet, then MAD_COLLAPSE saw some dirty folios. This is
how your reproducer works at least. This is why filemap_flush() was
added in the first place. Please see commit
75f360696ce9d8ec8b253452b23b3e24c0689b4b.
> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
>
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133 /mnt/xfs-mnt/large_binary_thp
> Size: 16384 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> Rss: 256 kB
> Pss: 256 kB
> Pss_Dirty: 256 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 256 kB
>
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
smaps shows private dirty if either the PTE is dirty or the folio is
dirty. For this case, I don't expect the PTE is dirty.
> This may be due to dynamic linker and relocations that occurred during program loading.
>
> Reproduction using XFS/EXT4:
>
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> 2MB-aligned and sized to a multiple of 2MB.
> Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
> LOAD 0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
>
> 2. Create and mount the XFS/EXT4 fs:
> dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> losetup -f --show /tmp/xfs-test.img # output: /dev/loop0
> mkfs.xfs -f /dev/loop0
> mkdir -p /mnt/xfs-mnt
> mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2.
>
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> int fd = open("/proc/self/exe", O_RDONLY);
> if (fd >= 0) {
> fsync(fd);
> close(fd);
> }
> // Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
>
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> rather than a transient condition that could succeed on retry.
Yeah, I agree the return value is confusing. -EAGAIN may be better as
suggested by others.
>
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> behavior for MADV_COLLAPSE?
Maybe MADV_COLLAPSE can have some retry logic?
Thanks,
Yang
>
> Would appreciate thoughts on the best approach to address this issue.
>
> Thanks,
> Shivank
>
Powered by blists - more mailing lists