linux-kernel - Re: [PATCH 1/2] mm/khugepaged: do synchronous writeback for MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc07aa10-d8d1-435f-9393-6c4ab63cc179@lucifer.local>
Date: Mon, 10 Nov 2025 13:22:16 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Garg, Shivank" <shivankg@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>, Zi Yan <ziy@...dia.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
        Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
        Lance Yang <lance.yang@...ux.dev>,
        Steven Rostedt <rostedt@...dmis.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Zach O'Keefe <zokeefe@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        Branden Moore <Branden.Moore@....com>
Subject: Re: [PATCH 1/2] mm/khugepaged: do synchronous writeback for
 MADV_COLLAPSE

On Mon, Nov 10, 2025 at 06:37:58PM +0530, Garg, Shivank wrote:
>
>
> On 11/10/2025 5:31 PM, Lorenzo Stoakes wrote:
> > On Mon, Nov 10, 2025 at 11:32:53AM +0000, Shivank Garg wrote:
> >> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>
> >> ---
> >> Applies cleanly on:
> >> 6.18-rc5
> >> mm-stable:e9a6fb0bc
> >
> > Please base on mm-unstable. mm-stable is usually out of date until very close to
> > merge window.
>
> I'm observing issues when testing with kselftest on mm-unstable and mm-new branches that prevent
> proper testing for my patches:
>
> On mm-unstable (without my patches):
>
> # # running ./transhuge-stress -d 20
> # # --------------------------------
> # # TAP version 13
> # # 1..1
> # # transhuge-stress: allocate 220271 transhuge pages, using 440543 MiB virtual memory and 1720 MiB of ram
>
>
> [  367.225667] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
> [  367.230635] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 5b 99 48 8b 78 60 48 01 cf <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
> [  367.249378] RSP: 0000:ffffcde32943fba8 EFLAGS: 00010282
> [  367.254605] RAX: ffff8bd1668fdc00 RBX: 00007ffc15df5000 RCX: 00003fffffffffe0
> [  367.261736] RDX: ffffffff995cb530 RSI: 0003ffffffffffff RDI: ffffcbd1560dffe0
> [  367.268862] RBP: 0003ffffffffffff R08: ffffcde32943fc47 R09: 0000000000000000
> [  367.275994] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  367.283129] R13: 0000000000000000 R14: ffff8bd1668fdc00 R15: 0000000000100cca
> [  367.290260] FS:  00007ff600af5b80(0000) GS:ffff8c4e9ec7e000(0000) knlGS:0000000000000000
> [  367.298344] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  367.304083] CR2: ffffcbd1560dffe8 CR3: 00000001280e9001 CR4: 0000000000770ef0
> [  367.311216] PKRU: 55555554
> [  367.313929] Call Trace:
> [  367.316375]  <TASK>
> [  367.318479]  __read_swap_cache_async+0x8e/0x1b0
> [  367.323014]  swap_vma_readahead+0x23d/0x430
> [  367.327198]  swapin_readahead+0xb0/0xc0
> [  367.331039]  do_swap_page+0x5bc/0x1260
> [  367.334789]  ? rseq_ip_fixup+0x6f/0x190
> [  367.338631]  ? __pfx_default_wake_function+0x10/0x10
> [  367.343596]  __handle_mm_fault+0x49a/0x760
> [  367.347696]  handle_mm_fault+0x188/0x300
> [  367.351620]  do_user_addr_fault+0x15b/0x6c0
> [  367.355807]  exc_page_fault+0x60/0x100
> [  367.359562]  asm_exc_page_fault+0x22/0x30
> [  367.363574] RIP: 0033:0x7ff60091ba99
> [  367.367153] Code: f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 40 c4 01 00 f3 0f 1e fa 80 3d b5 f5 0e 00 00 74 13 31 c0 0f 05 48 3d 00 f0 ff ff 77 4f <c3> 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75
> [  367.385897] RSP: 002b:00007ffc15df1118 EFLAGS: 00010203
> [  367.391124] RAX: 0000000000000001 RBX: 000055941fb672a0 RCX: 00007ff60091ba91
> [  367.398256] RDX: 0000000000000001 RSI: 000055941fb813e0 RDI: 0000000000000000
> [  367.405387] RBP: 00007ffc15df21e0 R08: 0000000000000000 R09: 0000000000000007
> [  367.412513] R10: 000055941fb97cb0 R11: 0000000000000246 R12: 000055941fb813e0
> [  367.419646] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  367.426781]  </TASK>
> [  367.428970] Modules linked in: xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd mgag200 rapl drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper acpi_cpufreq i2c_piix4 ptdma k10temp i2c_smbus wmi acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler sg dm_multipath drm fuse dm_mod nfnetlink ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 kvm_amd sd_mod ahci nvme libahci kvm libata nvme_core tg3 ccp megaraid_sas irqbypass
> [  367.497528] CR2: ffffcbd1560dffe8
> [  367.500846] ---[ end trace 0000000000000000 ]---

Yikes, oopsies!

I'll try running tests locally on threadripper, but ran tests against yours
previously and seemed fine, strange. Maybe fixed since but let me try, maybe
because swap is not enabled locally for me?

Likely this actually...

>
>
>
> -----------------
> On mm-new (without my patches):
>
> [  394.144770] get_swap_device: Bad swap offset entry 3ffffffffffff
>
> dmesg | grep "get_swap_device: Bad swap offset entry" | wc -l
> 359
>
>
> Additionally, kexec triggers an oops and crash during swapoff:
>
>
>          Deactivating swap   704.854238] BUG: unable to handle page fault for address: ffffcbe2de8dffe8
> [  704.861524] #PF: supervisor read access in kernel mode
> ;39mswap.img.swa[  704.866666] #PF: error_code(0x0000) - not-present page
> [  704.873253] PGD 0 P4D 0
> p - /swap.im[  704.875790] Oops: Oops: 0000 [#1] SMP NOPTI
> g...
> [  704.881354] CPU: 122 UID: 0 PID: 107680 Comm: swapoff Kdump: loaded Not tainted 6.18.0-rc5+ #11 NONE
> [  704.891283] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.16.2 07/09/2024
> [  704.898930] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
> [  704.903907] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 7b 95 48 8b 78 60 48 01 cf <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
> [  704.922720] RSP: 0018:ffffcf1227b1fc08 EFLAGS: 00010282
> [  704.928035] RAX: ffff8be2cefb3c00 RBX: 0000555c65a5c000 RCX: 00003fffffffffe0
> [  704.928036] RDX: 0003ffffffffffff RSI: 0003ffffffffffff RDI: ffffcbe2de8dffe0
> [  704.928037] RBP: 0000000000000000 R08: ffff8be2de8e0520 R09: 0000000000000000
>          Unmount[  704.928038] R10: 000000000000ffff R11: ffffcf12236f4000 R12: ffff8be2d5b8d968
> [  704.928039] R13: 0003ffffffffffff R14: fffff3eec85eb000 R15: 0000555c65a51000
> [  704.928039] FS:  00007f41fcab3800(0000) GS:ffff8c602b6fe000(0000) knlGS:0000000000000000
> [  704.928040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  704.928041] CR2: ffffcbe2de8dffe8 CR3: 00000074981af004 CR4: 0000000000770ef0
> [  704.928041] PKRU: 55555554
> [  704.928042] Call Trace:
> [  704.928043]  <TASK>
> [  704.928044]  unuse_pte_range+0x10b/0x290
> [  704.928047]  unuse_pud_range.isra.0+0x149/0x190
> [  704.928048]  unuse_vma+0x1a6/0x220
> [  704.928050]  unuse_mm+0x9b/0x110
> [  704.928052]  try_to_unuse+0xc5/0x260
> [  704.928053]  __do_sys_swapoff+0x244/0x670
> ing boo[  705.016662]  do_syscall_64+0x67/0xc50
> [  705.016667]  ? do_user_addr_fault+0x15b/0x6c0
> t.mount - /b[  705.026100]  ? exc_page_fault+0x60/0x100
> [  705.031498]  ? irqentry_exit_to_user_mode+0x20/0xe0
> oot...
> [  705.036377]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  705.042200] RIP: 0033:0x7f41fc9271bb
> [  705.045780] Code: 0f 1e fa 48 83 fe 01 48 8b 15 59 bc 0d 00 19 c0 83 e0 f0 83 c0 26 64 89 02 b8 ff ff ff ff c3 f3 0f 1e fa b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d bc 0d 00 f7 d8 64 89 01 48
> [  705.064807] RSP: 002b:00007ffd14b5b6e8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a8
> [  705.064809] RAX: ffffffffffffffda RBX: 00007ffd14b5cf30 RCX: 00007f41fc9271bb
> [  705.064810] RDX: 0000000000000001 RSI: 0000000000000c00 RDI: 000055d48f533a40
> [  705.064810] RBP: 00007ffd14b5b750 R08: 00007f41fca03b20 R09: 0000000000000000
> [  705.064811] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000000
> [  705.064811] R13: 0000000000000000 R14: 000055d4584f1479 R15: 000055d4584f2b20
> [  705.064813]  </TASK>
> [  705.064814] Modules linked in: xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd rapl mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper acpi_cpufreq i2c_piix4 ptdma ipmi_si k10temp i2c_smbus acpi_power_meter wmi acpi_ipmi ipmi_msghandler sg dm_multipath fuse drm dm_mod nfnetlink ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 sd_mod kvm_amd ahci libahci kvm nvme tg3 libata ccp irqbypass nvme_core megaraid_sas [last unloaded: ipmi_devintf]
> [  705.180420] CR2: ffffcbe2de8dffe8
> [  705.183852] ---[ end trace 0000000000000000 ]---
>
>
> I haven't had cycles to dig into this yet and been swamped with other things.

Fully understand, I'm _very_ familiar with this situation :)

I need more cores... ;)

>
>
> >> +	if (!is_shmem && cc && !cc->is_khugepaged && mapping_can_writeback(mapping)) {
> >> +		range_start = (loff_t)start << PAGE_SHIFT;
> >> +		range_end = ((loff_t)end << PAGE_SHIFT) - 1;
> >> +		if (filemap_write_and_wait_range(mapping, range_start, range_end)) {
> >> +			result = SCAN_FAIL;
> >> +			goto out;
> >> +		}
> >> +	}
> >
> > I feel this is the wrong level of abstraction.
> >
> > We explicitly invoke this oth from khugepaged and madvise(..., MADV_COLLAPSE):
> >
> >
> > khugepaged_scan_mm_slot() / madvise_collapse()
> > -> hpage_collapse_scan_file()
> > -> collapse_file()
> >
> > ofc you are addressing this with the !cc->is_khugepaged, but feels like we'd be
> > better off just doing it in madvise_collapse().
> >
> > I wonder also if doing I/O without getting the mmap lock again and revalidating
> > is wise, as the state of things might have changed significantly.
> >
> > So maybe need a hugepage_vma_revalidate() as well?
>
> Thanks for the feedback. I'll incorporate these comments for v2.

Thanks!

>
> Thanks,
> Shivank
>
>

Cheers, Lorenzo