[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8bc796e2-f652-4c12-a347-7b778ae7f899@arm.com>
Date: Thu, 6 Nov 2025 16:32:58 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: "Garg, Shivank" <shivankg@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>, Nico Pache <npache@...hat.com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, zokeefe@...gle.com
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On 06/11/2025 12:16, Garg, Shivank wrote:
> Hi All,
>
> I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> current behavior and improvements.
>
> Problem:
> When attempting to collapse read-only file-backed TEXT sections into THPs
> using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> are marked dirty.
> madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
>
> Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
>
> Root Cause:
> The failure occurs in mm/khugepaged.c:collapse_file():
> } else if (folio_test_dirty(folio)) {
> /*
> * khugepaged only works on read-only fd,
> * so this page is dirty because it hasn't
> * been flushed since first write. There
> * won't be new dirty pages.
> *
> * Trigger async flush here and hope the
> * writeback is done when khugepaged
> * revisits this page.
> */
> xas_unlock_irq(&xas);
> filemap_flush(mapping);
> result = SCAN_FAIL;
> goto xa_unlocked;
> }
>
> Why the text pages are dirty?
This is the real question to to answer, I think...
What architecture are you running on?
> It initially seemed unusual for a read-only text section to be marked as dirty, but
> this was actually confirmed by /proc/pid/smaps.
>
> 55bc90200000-55bc91200000 r-xp 00400000 07:00 133 /mnt/xfs-mnt/large_binary_thp
> Size: 16384 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> Rss: 256 kB
> Pss: 256 kB
> Pss_Dirty: 256 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 256 kB
>
> /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> This may be due to dynamic linker and relocations that occurred during program loading.
On arm64 at least, I wouldn't expect the text to be modified. Relocations should
be handled in data. But given you have private dirty pages here, they must have
been cow'ed and are therefore anonymous? In which case, where is writeback
actually going?
>
> Reproduction using XFS/EXT4:
>
> 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> 2MB-aligned and sized to a multiple of 2MB.
> Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
> LOAD 0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
>
> 2. Create and mount the XFS/EXT4 fs:
> dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> losetup -f --show /tmp/xfs-test.img # output: /dev/loop0
> mkfs.xfs -f /dev/loop0
> mkdir -p /mnt/xfs-mnt
> mount /dev/loop0 /mnt/xfs-mnt
> 3. Copy the binaries to /mnt/xfs-mnt and execute.
> 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> 5. To reproduce again; reboot/kexec and repeat from step 2.
>
> Workaround:
> 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> int fd = open("/proc/self/exe", O_RDONLY);
> if (fd >= 0) {
> fsync(fd);
> close(fd);
> }
> // Now madvise(MADV_COLLAPSE) succeeds
> 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
>
> Problems with Current Behavior:
> 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> rather than a transient condition that could succeed on retry.
>
> 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> behavior for MADV_COLLAPSE?
>
> Would appreciate thoughts on the best approach to address this issue.
>
> Thanks,
> Shivank
Powered by blists - more mailing lists