lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b84865c-5b23-4be6-9902-af9d5e63c182@amd.com>
Date: Fri, 7 Nov 2025 14:21:32 +0530
From: "Garg, Shivank" <shivankg@....com>
To: "David Hildenbrand (Red Hat)" <davidhildenbrandkernel@...il.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Ryan Roberts <ryan.roberts@....com>,
 Andrew Morton <akpm@...ux-foundation.org>, Zi Yan <ziy@...dia.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Nico Pache <npache@...hat.com>,
 Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
 Lance Yang <lance.yang@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
 Jann Horn <jannh@...gle.com>, zokeefe@...gle.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_COLLAPSE) fails with EINVAL on dirty file-backed
 text pages



On 11/7/2025 2:35 AM, David Hildenbrand (Red Hat) wrote:
> On 06.11.25 18:17, Lorenzo Stoakes wrote:
>> On Thu, Nov 06, 2025 at 11:55:05AM -0500, Liam R. Howlett wrote:
>>> * Ryan Roberts <ryan.roberts@....com> [251106 11:33]:
>>>> On 06/11/2025 12:16, Garg, Shivank wrote:
>>>>> Hi All,


Hi all,

Thank you for the quick responses and suggestions!
Information asked in this thread:
1. Architecture: X86_64

2. I want to emphasize that the error occurs specifically on a fresh mount after copying the binary.
   Binary can either be freshly compiled or previously compiled. The key factor is the fresh
   mount and copy operation.

3. For workaround:
   I'm calling fsync(fd) from inside the executable before madvise().
   Alternatively, I just tried that running sync from the shell after copying the binary
   also works, as it clears the Private_Dirty pages shown in smaps.

4. readelf --wide --segments large_binary_thp_s_withoutfsync
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x4012e0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R   0x8
  INTERP         0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x24aa38 0x24aa38 R   0x1000
  LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x1000000 0x1000000 R E 0x200000
  LOAD           0x1400000 0x0000000001400000 0x0000000001400000 0x53c750 0x53c750 R   0x1000
  LOAD           0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0c3810 0x0c3820 RW  0x1000
  DYNAMIC        0x193cd28 0x000000000193dd28 0x000000000193dd28 0x0001f0 0x0001f0 RW  0x8
  NOTE           0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R   0x8
  NOTE           0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R   0x4
  GNU_PROPERTY   0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R   0x8
  GNU_EH_FRAME   0x156bc5c 0x000000000156bc5c 0x000000000156bc5c 0x0c356c 0x0c356c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x193cd10 0x000000000193dd10 0x000000000193dd10 0x0002f0 0x0002f0 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
   03     .align_load_begin .init .plt .plt.got .plt.sec .text .fini .align_load_end
   04     .rodata .eh_frame_hdr .eh_frame
   05     .init_array .fini_array .dynamic .got .data .bss
   06     .dynamic
   07     .note.gnu.property
   08     .note.gnu.build-id .note.ABI-tag
   09     .note.gnu.property
   10     .eh_frame_hdr
   11
   12     .init_array .fini_array .dynamic .got

4. Logs from --- Before Collapse --- 

smaps:
55d436a00000-55d437a00000 r-xp 00400000 07:00 135                        /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size:              16384 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                 256 kB
Pss:                 256 kB
Pss_Dirty:           256 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:       256 kB
Referenced:          256 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
ProtectionKey:         0
VmFlags: rd ex mr mw me sd

numa_maps:
55d436a00000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync dirty=64 active=0 N1=64 kernelpagesize_kB=4

Additional logs inside the kernel:
[  129.257258] collapse_file: ENTER addr=55d436a00000 start=1024 end=1536 is_shmem=0
[  129.257266] collapse_file: allocated new_folio successfully
[  129.257267] collapse_file: XArray slots created, starting page scan
[  129.257268] collapse_file: scanning index=1024 folio=00000000be1a13db
[  129.257270] collapse_file: folio_test_dirty index=1024
[  129.257271]   folio=00000000be1a13db, flags=0x57ffffc8000078
[  129.257272]   mapping=000000004df7b047, inode=000000003395e5a1
[  129.257273]   folio_test_large=1
[  129.257273]   inode mode=0100755, i_writecount=-1 inode_is_open_for_write(inode)=0

[  129.257279]   VMA #2: 000055d436a00000-000055d437a00000 flags=0x8000075 PID=5268 comm=large_binary_th <-- CONTAINS DIRTY FOLIO
                Perms: r-xp  MAYWRITE MAYEXEC
[  129.257281]     File offset range: 0x400000 - 0x1400000
[  129.257282]     Page index range: 1024 - 5120

[  129.257289]   Total VMAs: 5, Writable VMAs: 0
[  129.257290]   Page details:
[  129.257290]     PG_dirty=1
[  129.257290]     PG_writeback=0
[  129.257291]     PG_uptodate=1
[  129.257291]     PG_locked=0
[  129.257292]     refcount=64
[  129.257292]     mapcount=32
[  129.260652] collapse_file: folio_test_dirty FAILED index=1024
[  129.260655] collapse_file: FAILED result=0, going to rollback
[  129.260656] collapse_file: ROLLBACK result=0
[  129.260661] collapse_file: EXIT result=0 
[  129.260661] collapse_file 0
[  129.260662] default 0
[  129.260663] madvise_collapse_errno: -22 last_fail: 0
[  129.260665] thps 0 ((hend - hstart) >> HPAGE_PMD_SHIFT) 8

Note: result=0 is SCAN_FAIL

Now, after the failure on first attempt, when I run the executable again:

-- success run --
Region is 0x56185f800000 to 0x561860800000 - length 16777216
56185f800000-561860800000 r-xp 00400000 07:00 135                        /mnt/xfs-mnt/large_binary_thp_s_withoutfsync
Size:              16384 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                 256 kB
Pss:                 256 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:       256 kB
Private_Dirty:         0 kB
Referenced:          256 kB
Anonymous:             0 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
ProtectionKey:         0
VmFlags: rd ex mr mw me sd

56185f800000 default file=/mnt/xfs-mnt/large_binary_thp_s_withoutfsync mapped=64 active=0 N1=64 kernelpagesize_kB=4

  Start: 0x56185f800000
  End:   0x561860800000
  Size:  16777216 bytes (16.00 MB)
  Hugepages: 8 x 2MB

Calling madvise(MADV_COLLAPSE)...
Successfully collapsed text section into hugepages!


5. Yes, I'm calling madvise(MADV_COLLAPSE) on the text portion of the executable, using the address
   range obtained from /proc/self/maps. IIUC, this should benefit applications by reducing ITLB pressure.

I agree with the suggestions to either Return EAGAIN instead of EINVAL or At minimum, document the
EINVAL return for dirty pages. I'm happy to work on a patch.

Please let me know if any other information is needed for debugging.

Thanks,
Shivank

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ