linux-kernel - Re: [PATCH v3 13/20] mm: Copy-on-Write (COW) reuse support for PTE-mapped THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CcSZf0be+OwttyzC3ZQJWZPOtDK1AtXvaPkuVuVk-XOg@mail.gmail.com>
Date: Sun, 20 Apr 2025 00:35:26 +0800
From: Kairui Song <ryncsn@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, 
	cgroups@...r.kernel.org, linux-mm@...ck.org, linux-fsdevel@...r.kernel.org, 
	linux-api@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, Tejun Heo <tj@...nel.org>, Zefan Li <lizefan.x@...edance.com>, 
	Johannes Weiner <hannes@...xchg.org>, Michal Koutný <mkoutny@...e.com>, 
	Jonathan Corbet <corbet@....net>, Andy Lutomirski <luto@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, Muchun Song <muchun.song@...ux.dev>, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH v3 13/20] mm: Copy-on-Write (COW) reuse support for
 PTE-mapped THP

On Sun, Apr 20, 2025 at 12:32 AM David Hildenbrand <david@...hat.com> wrote:
>
> On 19.04.25 18:25, David Hildenbrand wrote:
> > On 19.04.25 18:02, Kairui Song wrote:
> >> On Tue, Mar 4, 2025 at 12:46 AM David Hildenbrand <david@...hat.com> wrote:
> >>>
> >>> Currently, we never end up reusing PTE-mapped THPs after fork. This
> >>> wasn't really a problem with PMD-sized THPs, because they would have to
> >>> be PTE-mapped first, but it's getting a problem with smaller THP
> >>> sizes that are effectively always PTE-mapped.
> >>>
> >>> With our new "mapped exclusively" vs "maybe mapped shared" logic for
> >>> large folios, implementing CoW reuse for PTE-mapped THPs is straight
> >>> forward: if exclusively mapped, make sure that all references are
> >>> from these (our) mappings. Add some helpful comments to explain the
> >>> details.
> >>>
> >>> CONFIG_TRANSPARENT_HUGEPAGE selects CONFIG_MM_ID. If we spot an anon
> >>> large folio without CONFIG_TRANSPARENT_HUGEPAGE in that code, something
> >>> is seriously messed up.
> >>>
> >>> There are plenty of things we can optimize in the future: For example, we
> >>> could remember that the folio is fully exclusive so we could speedup
> >>> the next fault further. Also, we could try "faulting around", turning
> >>> surrounding PTEs that map the same folio writable. But especially the
> >>> latter might increase COW latency, so it would need further
> >>> investigation.
> >>>
> >>> Signed-off-by: David Hildenbrand <david@...hat.com>
> >>> ---
> >>>    mm/memory.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++------
> >>>    1 file changed, 75 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 73b783c7d7d51..bb245a8fe04bc 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3729,19 +3729,86 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
> >>>           return ret;
> >>>    }
> >>>
> >>> -static bool wp_can_reuse_anon_folio(struct folio *folio,
> >>> -                                   struct vm_area_struct *vma)
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
> >>> +               struct vm_area_struct *vma)
> >>>    {
> >>> +       bool exclusive = false;
> >>> +
> >>> +       /* Let's just free up a large folio if only a single page is mapped. */
> >>> +       if (folio_large_mapcount(folio) <= 1)
> >>> +               return false;
> >>> +
> >>>           /*
> >>> -        * We could currently only reuse a subpage of a large folio if no
> >>> -        * other subpages of the large folios are still mapped. However,
> >>> -        * let's just consistently not reuse subpages even if we could
> >>> -        * reuse in that scenario, and give back a large folio a bit
> >>> -        * sooner.
> >>> +        * The assumption for anonymous folios is that each page can only get
> >>> +        * mapped once into each MM. The only exception are KSM folios, which
> >>> +        * are always small.
> >>> +        *
> >>> +        * Each taken mapcount must be paired with exactly one taken reference,
> >>> +        * whereby the refcount must be incremented before the mapcount when
> >>> +        * mapping a page, and the refcount must be decremented after the
> >>> +        * mapcount when unmapping a page.
> >>> +        *
> >>> +        * If all folio references are from mappings, and all mappings are in
> >>> +        * the page tables of this MM, then this folio is exclusive to this MM.
> >>>            */
> >>> -       if (folio_test_large(folio))
> >>> +       if (folio_test_large_maybe_mapped_shared(folio))
> >>> +               return false;
> >>> +
> >>> +       VM_WARN_ON_ONCE(folio_test_ksm(folio));
> >>> +       VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
> >>> +       VM_WARN_ON_ONCE(folio_entire_mapcount(folio));
> >>> +
> >>> +       if (unlikely(folio_test_swapcache(folio))) {
> >>> +               /*
> >>> +                * Note: freeing up the swapcache will fail if some PTEs are
> >>> +                * still swap entries.
> >>> +                */
> >>> +               if (!folio_trylock(folio))
> >>> +                       return false;
> >>> +               folio_free_swap(folio);
> >>> +               folio_unlock(folio);
> >>> +       }
> >>> +
> >>> +       if (folio_large_mapcount(folio) != folio_ref_count(folio))
> >>>                   return false;
> >>>
> >>> +       /* Stabilize the mapcount vs. refcount and recheck. */
> >>> +       folio_lock_large_mapcount(folio);
> >>> +       VM_WARN_ON_ONCE(folio_large_mapcount(folio) < folio_ref_count(folio));
> >>
> >> Hi David, I'm seeing this WARN_ON being triggered on my test machine:
> >
> > Hi!
> >
> > So I assume the following will not sort out the issue for you, correct?
> >
> > https://lore.kernel.org/all/20250415095007.569836-1-david@redhat.com/T/#u
> >
> >>
> >> I'm currently working on my swap table series and testing heavily with
> >> swap related workloads. I thought my patch may break the kernel, but
> >> after more investigation and reverting to current mm-unstable, it
> >> still occurs (with a much lower chance though, I think my series
> >> changed the timing so it's more frequent in my case).
> >>
> >> The test is simple, I just enable all mTHP sizes and repeatedly build
> >> linux kernel in a 1G memcg using tmpfs.
> >>
> >> The WARN is reproducible with current mm-unstable
> >> (dc683247117ee018e5da6b04f1c499acdc2a1418):
> >>
> >> [ 5268.100379] ------------[ cut here ]------------
> >> [ 5268.105925] WARNING: CPU: 2 PID: 700274 at mm/memory.c:3792
> >> do_wp_page+0xfc5/0x1080
> >> [ 5268.112437] Modules linked in: zram virtiofs
> >> [ 5268.115507] CPU: 2 UID: 0 PID: 700274 Comm: cc1 Kdump: loaded Not
> >> tainted 6.15.0-rc2.ptch-gdc683247117e #1434 PREEMPT(voluntary)
> >> [ 5268.120562] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> >> [ 5268.123025] RIP: 0010:do_wp_page+0xfc5/0x1080
> >> [ 5268.124807] Code: 0d 80 77 32 02 0f 85 3e f1 ff ff 0f 1f 44 00 00
> >> e9 34 f1 ff ff 48 0f ba 75 00 1f 65 ff 0d 63 77 32 02 0f 85 21 f1 ff
> >> ff eb e1 <0f> 0b e9 10 fd ff ff 65 ff 00 f0 48 0f b
> >> a 6d 00 1f 0f 83 ec fc ff
> >> [ 5268.132034] RSP: 0000:ffffc900234efd48 EFLAGS: 00010297
> >> [ 5268.134002] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 000fffffffe00000
> >> [ 5268.136609] RDX: 0000000000000081 RSI: 00007f009cbad000 RDI: ffffea0012da0000
> >> [ 5268.139371] RBP: ffffea0012da0068 R08: 80000004b682d025 R09: 00007f009c7c0000
> >> [ 5268.142183] R10: ffff88839c48b8c0 R11: 0000000000000000 R12: ffff88839c48b8c0
> >> [ 5268.144738] R13: ffffea0012da0000 R14: 00007f009cbadf10 R15: ffffc900234efdd8
> >> [ 5268.147540] FS:  00007f009d1fdac0(0000) GS:ffff88a07ae14000(0000)
> >> knlGS:0000000000000000
> >> [ 5268.150715] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 5268.153270] CR2: 00007f009cbadf10 CR3: 000000016c7c0001 CR4: 0000000000770eb0
> >> [ 5268.155674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 5268.158100] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [ 5268.160613] PKRU: 55555554
> >> [ 5268.161662] Call Trace:
> >> [ 5268.162609]  <TASK>
> >> [ 5268.163438]  ? ___pte_offset_map+0x1b/0x110
> >> [ 5268.165309]  __handle_mm_fault+0xa51/0xf00
> >> [ 5268.166848]  ? update_load_avg+0x80/0x760
> >> [ 5268.168376]  handle_mm_fault+0x13d/0x360
> >> [ 5268.169930]  do_user_addr_fault+0x2f2/0x7f0
> >> [ 5268.171630]  exc_page_fault+0x6a/0x140
> >> [ 5268.173278]  asm_exc_page_fault+0x26/0x30
> >> [ 5268.174866] RIP: 0033:0x120e8e4
> >> [ 5268.176272] Code: 84 a9 00 00 00 48 39 c3 0f 85 ae 00 00 00 48 8b
> >> 43 20 48 89 45 38 48 85 c0 0f 85 b7 00 00 00 48 8b 43 18 48 8b 15 6c
> >> 08 42 01 <0f> 11 43 10 48 89 1d 61 08 42 01 48 89 53 18 0f 11 03 0f 11
> >> 43 20
> >> [ 5268.184121] RSP: 002b:00007fff8a855160 EFLAGS: 00010246
> >> [ 5268.186343] RAX: 00007f009cbadbd0 RBX: 00007f009cbadf00 RCX: 0000000000000000
> >> [ 5268.189209] RDX: 00007f009cbba030 RSI: 00000000000006f4 RDI: 0000000000000000
> >> [ 5268.192145] RBP: 00007f009cbb6460 R08: 00007f009d10f000 R09: 000000000000016c
> >> [ 5268.194687] R10: 0000000000000000 R11: 0000000000000010 R12: 00007f009cf97660
> >> [ 5268.197172] R13: 00007f009756ede0 R14: 00007f0097582348 R15: 0000000000000002
> >> [ 5268.199419]  </TASK>
> >> [ 5268.200227] ---[ end trace 0000000000000000 ]---
> >>
> >> I also once changed the WARN_ON to WARN_ON_FOLIO and I got more info here:
> >>
> >> [ 3994.907255] page: refcount:9 mapcount:1 mapping:0000000000000000
> >> index:0x7f90b3e98 pfn:0x615028
> >> [ 3994.914449] head: order:3 mapcount:8 entire_mapcount:0
> >> nr_pages_mapped:8 pincount:0
> >> [ 3994.924534] memcg:ffff888106746000
> >> [ 3994.927868] anon flags:
> >> 0x17ffffc002084c(referenced|uptodate|owner_2|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
> >> [ 3994.933479] raw: 0017ffffc002084c ffff88816edd9128 ffffea000beac108
> >> ffff8882e8ba6bc9
> >> [ 3994.936251] raw: 00000007f90b3e98 0000000000000000 0000000900000000
> >> ffff888106746000
> >> [ 3994.939466] head: 0017ffffc002084c ffff88816edd9128
> >> ffffea000beac108 ffff8882e8ba6bc9
> >> [ 3994.943355] head: 00000007f90b3e98 0000000000000000
> >> 0000000900000000 ffff888106746000
> >> [ 3994.946988] head: 0017ffffc0000203 ffffea0018540a01
> >> 0000000800000007 00000000ffffffff
> >> [ 3994.950328] head: ffffffff00000007 00000000800000a3
> >> 0000000000000000 0000000000000008
> >> [ 3994.953684] page dumped because:
> >> VM_WARN_ON_FOLIO(folio_large_mapcount(folio) < folio_ref_count(folio))
> >> [ 3994.957534] ------------[ cut here ]------------
> >> [ 3994.959917] WARNING: CPU: 16 PID: 555282 at mm/memory.c:3794
> >> do_wp_page+0x10c0/0x1110
> >> [ 3994.963069] Modules linked in: zram virtiofs
> >> [ 3994.964726] CPU: 16 UID: 0 PID: 555282 Comm: sh Kdump: loaded Not
> >> tainted 6.15.0-rc1.ptch-ge39aef85f4c0-dirty #1431 PREEMPT(voluntary)
> >> [ 3994.969985] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
> >> [ 3994.972905] RIP: 0010:do_wp_page+0x10c0/0x1110
> >> [ 3994.974477] Code: fe ff 0f 0b bd f5 ff ff ff e9 16 fb ff ff 41 83
> >> a9 bc 12 00 00 01 e9 2f fb ff ff 48 c7 c6 90 c2 49 82 4c 89 ef e8 40
> >> fd fe ff <0f> 0b e9 6a fc ff ff 65 ff 00 f0 48 0f b
> >> a 6d 00 1f 0f 83 46 fc ff
> >> [ 3994.981033] RSP: 0000:ffffc9002b3c7d40 EFLAGS: 00010246
> >> [ 3994.982636] RAX: 000000000000005b RBX: 0000000000000000 RCX: 0000000000000000
> >> [ 3994.984778] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff889ffea16a80
> >> [ 3994.986865] RBP: ffffea0018540a68 R08: 0000000000000000 R09: c0000000ffff7fff
> >> [ 3994.989316] R10: 0000000000000001 R11: ffffc9002b3c7b80 R12: ffff88810cfd7d40
> >> [ 3994.991654] R13: ffffea0018540a00 R14: 00007f90b3e9d620 R15: ffffc9002b3c7dd8
> >> [ 3994.994076] FS:  00007f90b3caa740(0000) GS:ffff88a07b194000(0000)
> >> knlGS:0000000000000000
> >> [ 3994.996939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 3994.998902] CR2: 00007f90b3e9d620 CR3: 0000000104088004 CR4: 0000000000770eb0
> >> [ 3995.001314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> [ 3995.003746] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> [ 3995.006173] PKRU: 55555554
> >> [ 3995.007117] Call Trace:
> >> [ 3995.007988]  <TASK>
> >> [ 3995.008755]  ? __pfx_default_wake_function+0x10/0x10
> >> [ 3995.010490]  ? ___pte_offset_map+0x1b/0x110
> >> [ 3995.011929]  __handle_mm_fault+0xa51/0xf00
> >> [ 3995.013346]  handle_mm_fault+0x13d/0x360
> >> [ 3995.014796]  do_user_addr_fault+0x2f2/0x7f0
> >> [ 3995.016331]  ? sigprocmask+0x77/0xa0
> >> [ 3995.017656]  exc_page_fault+0x6a/0x140
> >> [ 3995.018978]  asm_exc_page_fault+0x26/0x30
> >> [ 3995.020309] RIP: 0033:0x7f90b3d881a7
> >> [ 3995.021461] Code: e8 4e b1 f8 ff 66 66 2e 0f 1f 84 00 00 00 00 00
> >> 0f 1f 00 f3 0f 1e fa 55 31 c0 ba 01 00 00 00 48 89 e5 53 48 89 fb 48
> >> 83 ec 08 <f0> 0f b1 15 71 54 11 00 0f 85 3b 01 00 0
> >> 0 48 8b 35 84 54 11 00 48
> >> [ 3995.028091] RSP: 002b:00007ffc33632c90 EFLAGS: 00010206
> >> [ 3995.029992] RAX: 0000000000000000 RBX: 0000560cfbfc0a40 RCX: 0000000000000000
> >> [ 3995.032456] RDX: 0000000000000001 RSI: 0000000000000005 RDI: 0000560cfbfc0a40
> >> [ 3995.034794] RBP: 00007ffc33632ca0 R08: 00007ffc33632d50 R09: 00007ffc33632cff
> >> [ 3995.037534] R10: 00007ffc33632c70 R11: 00007ffc33632d00 R12: 0000560cfbfc0a40
> >> [ 3995.041063] R13: 00007f90b3e97fd0 R14: 00007f90b3e97fa8 R15: 0000000000000000
> >> [ 3995.044390]  </TASK>
> >> [ 3995.045510] ---[ end trace 0000000000000000 ]---
> >>
> >> My guess is folio_ref_count is not a reliable thing to check here,
> >> anything can increase the folio's ref account even without locking it,
> >> for example, a swap cache lookup or maybe anything iterating the LRU.
> >
> > It is reliable, we are holding the mapcount lock, so for each mapcount
> > we must have a corresponding refcount. If that is not the case, we have
> > an issue elsewhere.
> >
> > Other reference may only increase the refcount, but not violate the
> > mapcount vs. refcount condition.
> >
> > Can you reproduce also with swap disabled?
>
> Oh, re-reading the condition 3 times, I realize that the sanity check is wrong ...
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 037b6ce211f1f..a17eeef3f1f89 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3789,7 +3789,7 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
>
>          /* Stabilize the mapcount vs. refcount and recheck. */
>          folio_lock_large_mapcount(folio);
> -       VM_WARN_ON_ONCE(folio_large_mapcount(folio) < folio_ref_count(folio));
> +       VM_WARN_ON_ONCE(folio_large_mapcount(folio) > folio_ref_count(folio));

Ah, now it makes sense to me now :)

Thanks for the quick response.

>
>          if (folio_test_large_maybe_mapped_shared(folio))
>                  goto unlock;
>
> Our refcount must be at least the mapcount, that's what we want to assert.
>
> Can you test and send a fix patch if that makes it fly for you?

Sure I'll keep the testing, I think it will just fix it, I have a few
WARN_ON_FOLIO reports all reporting mapcount is smaller than refcount.

>
> --
> Cheers,
>
> David / dhildenb
>