[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ozkb6mcxuymlz7tm4vcnqf266gd4ruiik2zal2koo5ffprgxfk@35godtyix2cf>
Date: Thu, 6 Nov 2025 11:55:05 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: "Garg, Shivank" <shivankg@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
Nico Pache <npache@...hat.com>, Dev Jain <dev.jain@....com>,
Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
zokeefe@...gle.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
* Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
> On 06/11/2025 12:16, Garg, Shivank wrote:
> > Hi All,
> >
> > I've been investigating an issue with madvise(MADV_COLLAPSE) for TEXT pages
> > when CONFIG_READ_ONLY_THP_FOR_FS=y is enabled, and would like to discuss the
> > current behavior and improvements.
> >
> > Problem:
> > When attempting to collapse read-only file-backed TEXT sections into THPs
> > using madvise(MADV_COLLAPSE), the operation fails with EINVAL if the pages
> > are marked dirty.
> > madvise(aligned_start, aligned_size, MADV_COLLAPSE) -> returns -1 and errno = -22
> >
> > Subsequent calls to madvise(MADV_COLLAPSE) succeed because the first madvise
> > attempt triggers filemap_flush() which initiates async writeback of the dirty folios.
> >
> > Root Cause:
> > The failure occurs in mm/khugepaged.c:collapse_file():
> > } else if (folio_test_dirty(folio)) {
> > /*
> > * khugepaged only works on read-only fd,
> > * so this page is dirty because it hasn't
> > * been flushed since first write. There
> > * won't be new dirty pages.
> > *
> > * Trigger async flush here and hope the
> > * writeback is done when khugepaged
> > * revisits this page.
> > */
> > xas_unlock_irq(&xas);
> > filemap_flush(mapping);
> > result = SCAN_FAIL;
> > goto xa_unlocked;
> > }
> >
> > Why the text pages are dirty?
>
> This is the real question to to answer, I think...
Agree with Ryan here, let's stop things from being marked dirty if they
are not.
>
> What architecture are you running on?
>
>
> > It initially seemed unusual for a read-only text section to be marked as dirty, but
> > this was actually confirmed by /proc/pid/smaps.
> >
> > 55bc90200000-55bc91200000 r-xp 00400000 07:00 133 /mnt/xfs-mnt/large_binary_thp
> > Size: 16384 kB
> > KernelPageSize: 4 kB
> > MMUPageSize: 4 kB
> > Rss: 256 kB
> > Pss: 256 kB
> > Pss_Dirty: 256 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 256 kB
> >
> > /proc/pid/smaps (before calling MADV_COLLAPSE) showing Private_Dirty pages in r-xp mappings.
> > This may be due to dynamic linker and relocations that occurred during program loading.
>
> On arm64 at least, I wouldn't expect the text to be modified. Relocations should
> be handled in data. But given you have private dirty pages here, they must have
> been cow'ed and are therefore anonymous? In which case, where is writeback
> actually going?
>
> >
> > Reproduction using XFS/EXT4:
> >
> > 1. Compile a test binary with madvise(MADV_COLLAPSE), ensuring the load TEXT segment is
> > 2MB-aligned and sized to a multiple of 2MB.
> > Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
> > LOAD 0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
> >
> > 2. Create and mount the XFS/EXT4 fs:
> > dd if=/dev/zero of=/tmp/xfs-test.img bs=1M count=1024
> > losetup -f --show /tmp/xfs-test.img # output: /dev/loop0
> > mkfs.xfs -f /dev/loop0
> > mkdir -p /mnt/xfs-mnt
> > mount /dev/loop0 /mnt/xfs-mnt
> > 3. Copy the binaries to /mnt/xfs-mnt and execute.
> > 4. Returns -EINVAL on first run, then run successfully on subsequent run. (100% reproducible)
> > 5. To reproduce again; reboot/kexec and repeat from step 2.
> >
> > Workaround:
> > 1. Manually flush dirty pages before calling madvise(MADV_COLLAPSE):
> > int fd = open("/proc/self/exe", O_RDONLY);
> > if (fd >= 0) {
> > fsync(fd);
> > close(fd);
> > }
> > // Now madvise(MADV_COLLAPSE) succeeds
> > 2. Alternatively, retrying madvise_collapse on EINVAL failure also work.
> >
> > Problems with Current Behavior:
> > 1. Confusing Error Code: The syscall returns EINVAL which typically indicates invalid arguments
> > rather than a transient condition that could succeed on retry.
This is also an issue though. Reading the documentation on my system,
EINVAL with collapse has two meanings:
EINVAL addr is not page-aligned or length is negative.
EINVAL advice is not a valid.
Neither are right here.
EAGAIN seems to make sense, but the documentation would need to be
changed too:
EAGAIN A kernel resource was temporarily unavailable.
> >
> > 2. Non-Transparent Handling: Users are unaware they need to flush dirty pages manually. Current
> > madvise_collapse assumes the caller is khugepaged (as per code snippet comment) which will revisit
> > the page. However, when called via madvise(MADV_COLLAPSE), the userspace program typically don't
> > retry, making the async flush ineffective. Should we differentiate between madvise and khugepaged
> > behavior for MADV_COLLAPSE?
The collapse documentation states that it works on the existing state of
the system memory, so it is doing what it says but the EINVAL return on
dirty pages is not documented, afaict?
> >
> > Would appreciate thoughts on the best approach to address this issue.
> >
> > Thanks,
> > Shivank
>
Powered by blists - more mailing lists