[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b84865c-5b23-4be6-9902-af9d5e63c182@amd.com>
Date: Fri, 7 Nov 2025 14:21:32 +0530
From: "Garg, Shivank" <shivankg@....com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Ryan Roberts <ryan.roberts@....com>,
Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, zokeefe@...gle.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
text pages
On 11/7/2025 2:35 AM, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
>> On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
>>> * Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
>>>> On 06/11/2025 12:16, Garg, Shivank wrote:
>>>>> Hi All,
Hi all,
Thank you for the quick responses and suggestions!
Information asked in this thread:
1. Architecture: X86_64
2. I want to emphasize that the error occurs specifically on a fresh mount after copying the binary.
Binary can either be freshly compiled or previously compiled. The key factor is the fresh
mount and copy operation.
3. For workaround:
I'm calling fsync(fd) from inside the executable before madvise().
Alternatively, I just tried that running sync from the shell after copying the binary
also works, as it clears the Private_Dirty pages shown in smaps.
4. readelf --wide --segments large_binary_thp_s_withoutfsync
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x4012e0
There are 13 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R 0x8
INTERP 0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x24aa38 0x24aa38 R 0x1000
LOAD 0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
LOAD 0x1400000 0x0000000001400000 0x0000000001400000 0x53c750 0x53c750 R 0x1000
LOAD 0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0c3810 0x0c3820 RW 0x1000
DYNAMIC 0x193cd28 0x000000000193dd28 0x000000000193dd28 0x0001f0 0x0001f0 RW 0x8
NOTE 0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R 0x8
NOTE 0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R 0x4
GNU_PROPERTY 0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R 0x8
GNU_EH_FRAME 0x156bc5c 0x000000000156bc5c 0x000000000156bc5c 0x0c356c 0x0c356c R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0002f0 0x0002f0 R 0x1
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .align_load_begin .init .plt .plt.got .plt.sec .text .fini .align_load_end
04 .rodata .eh_frame_hdr .eh_frame
05 .init_array .fini_array .dynamic .got .data .bss
06 .dynamic
07 .note.gnu.property
08 .note.gnu.build-id .note.ABI-tag
09 .note.gnu.property
10 .eh_frame_hdr
11
12 .init_array .fini_array .dynamic .got
4. Logs from --- Before Collapse ---
smaps:
55d436a00000-55d437a00000 r-xp 00400000 07:00 135 /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size: 16384 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 256 kB
Pss: 256 kB
Pss_Dirty: 256 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 256 kB
Referenced: 256 kB
Anonymous: 0 kB
KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
ProtectionKey: 0
VmFlags: rd ex mr mw me sd
numa_maps:
55d436a00000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync dirty=64 active=0 N1=64 kernelpagesize_kB=4
Additional logs inside the kernel:
[ 129.257258] collapse_file: ENTER addr=55d436a00000 start=1024 end=1536 is_shmem=0
[ 129.257266] collapse_file: allocated new_folio successfully
[ 129.257267] collapse_file: XArray slots created, starting page scan
[ 129.257268] collapse_file: scanning index=1024 folio=00000000be1a13db
[ 129.257270] collapse_file: folio_test_dirty index=1024
[ 129.257271] folio=00000000be1a13db, flags=0x57ffffc8000078
[ 129.257272] mapping=000000004df7b047, inode=000000003395e5a1
[ 129.257273] folio_test_large=1
[ 129.257273] inode mode=0100755, i_writecount=-1 inode_is_open_for_write(inode)=0
[ 129.257279] VMA #2: 000055d436a00000-000055d437a00000 flags=0x8000075 PID=5268 comm=large_binary_th <-- CONTAINS DIRTY FOLIO
Perms: r-xp MAYWRITE MAYEXEC
[ 129.257281] File offset range: 0x400000 - 0x1400000
[ 129.257282] Page index range: 1024 - 5120
[ 129.257289] Total VMAs: 5, Writable VMAs: 0
[ 129.257290] Page details:
[ 129.257290] PG_dirty=1
[ 129.257290] PG_writeback=0
[ 129.257291] PG_uptodate=1
[ 129.257291] PG_locked=0
[ 129.257292] refcount=64
[ 129.257292] mapcount=32
[ 129.260652] collapse_file: folio_test_dirty FAILED index=1024
[ 129.260655] collapse_file: FAILED result=0, going to rollback
[ 129.260656] collapse_file: ROLLBACK result=0
[ 129.260661] collapse_file: EXIT result=0
[ 129.260661] collapse_file 0
[ 129.260662] default 0
[ 129.260663] madvise_collapse_errno: -22 last_fail: 0
[ 129.260665] thps 0 ((hend - hstart) >> HPAGE_PMD_SHIFT) 8
Note: result=0 is SCAN_FAIL
Now, after the failure on first attempt, when I run the executable again:
-- success run --
Region is 0x56185f800000 to 0x561860800000 - length 16777216
56185f800000-561860800000 r-xp 00400000 07:00 135 /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size: 16384 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 256 kB
Pss: 256 kB
Pss_Dirty: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 256 kB
Private_Dirty: 0 kB
Referenced: 256 kB
Anonymous: 0 kB
KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
ProtectionKey: 0
VmFlags: rd ex mr mw me sd
56185f800000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync mapped=64 active=0 N1=64 kernelpagesize_kB=4
Start: 0x56185f800000
End: 0x561860800000
Size: 16777216 bytes (16.00 MB)
Hugepages: 8 x 2MB
Calling madvise(MADV_COLLAPSE)...
Successfully collapsed text section into hugepages!
5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.
I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
EINVAL return for dirty pages. I'm happy to work on a patch.
Please let me know if any other information is needed for debugging.
Thanks,
Shivank
Powered by blists - more mailing lists