linux-kernel - Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zNounCe-N=BpqSLsk27FOJBJ9=eRbOE8CzOKF=H7kE0Q@mail.gmail.com>
Date: Thu, 20 Feb 2025 22:31:38 +1300
From: Barry Song <21cnbao@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: Suren Baghdasaryan <surenb@...gle.com>, Lokesh Gidra <lokeshgidra@...gle.com>, linux-mm@...ck.org, 
	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org, 
	zhengtangquan@...o.com, Barry Song <v-songbaohua@...o.com>, 
	Andrea Arcangeli <aarcange@...hat.com>, Al Viro <viro@...iv.linux.org.uk>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Brian Geffon <bgeffon@...gle.com>, 
	Christian Brauner <brauner@...nel.org>, Hugh Dickins <hughd@...gle.com>, Jann Horn <jannh@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, "Liam R . Howlett" <Liam.Howlett@...cle.com>, 
	Matthew Wilcox <willy@...radead.org>, Michal Hocko <mhocko@...e.com>, Mike Rapoport <rppt@...nel.org>, 
	Nicolas Geoffray <ngeoffray@...gle.com>, Peter Xu <peterx@...hat.com>, 
	Ryan Roberts <ryan.roberts@....com>, Shuah Khan <shuah@...nel.org>, 
	ZhangPeng <zhangpeng362@...wei.com>, Yu Zhao <yuzhao@...gle.com>
Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache

On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@...hat.com> wrote:
>
> On 19.02.25 21:37, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >>
> >> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@...il.com> wrote:
> >>>
> >>> From: Barry Song <v-songbaohua@...o.com>
> >>>
> >>> userfaultfd_move() checks whether the PTE entry is present or a
> >>> swap entry.
> >>>
> >>> - If the PTE entry is present, move_present_pte() handles folio
> >>>    migration by setting:
> >>>
> >>>    src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>>
> >>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>>    the PTE to the new dst_addr.
> >>>
> >>> This approach is incorrect because even if the PTE is a swap
> >>> entry, it can still reference a folio that remains in the swap
> >>> cache.
> >>>
> >>> If do_swap_page() is triggered, it may locate the folio in the
> >>> swap cache. However, during add_rmap operations, a kernel panic
> >>> can occur due to:
> >>>   page_pgoff(folio, page) != linear_page_index(vma, address)
> >>
> >> Thanks for the report and reproducer!
> >>
> >>>
> >>> $./a.out > /dev/null
> >>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >>> [   13.337716] memcg:ffff00000405f000
> >>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >>> [   13.340190] ------------[ cut here ]------------
> >>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >>> [   13.340969] Modules linked in:
> >>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >>> [   13.342018] sp : ffff80008752bb20
> >>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >>> [   13.343876] Call trace:
> >>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >>> [   13.344333]  do_swap_page+0x1060/0x1400
> >>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >>> [   13.344586]  do_page_fault+0x20c/0x770
> >>> [   13.344673]  do_translation_fault+0xb4/0xf0
> >>> [   13.344759]  do_mem_abort+0x48/0xa0
> >>> [   13.344842]  el0_da+0x58/0x130
> >>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >>> [   13.345504] ---[ end trace 0000000000000000 ]---
> >>> [   13.345715] note: a.out[107] exited with irqs disabled
> >>> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>>
> >>> Fully fixing it would be quite complex, requiring similar handling
> >>> of folios as done in move_present_pte.
> >>
> >> How complex would that be? Is it a matter of adding
> >> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> >> folio->index = linear_page_index like in move_present_pte() or
> >> something more?
> >
> > My main concern is still with large folios that require a split_folio()
> > during move_pages(), as the entire folio shares the same index and
> > anon_vma. However, userfaultfd_move() moves pages individually,
> > making a split necessary.
> >
> > However, in split_huge_page_to_list_to_order(), there is a:
> >
> >          if (folio_test_writeback(folio))
> >                  return -EBUSY;
> >
> > This is likely true for swapcache, right? However, even for move_present_pte(),
> > it simply returns -EBUSY:
> >
> > move_pages_pte()
> > {
> >                  /* at this point we have src_folio locked */
> >                  if (folio_test_large(src_folio)) {
> >                          /* split_folio() can block */
> >                          pte_unmap(&orig_src_pte);
> >                          pte_unmap(&orig_dst_pte);
> >                          src_pte = dst_pte = NULL;
> >                          err = split_folio(src_folio);
> >                          if (err)
> >                                  goto out;
> >
> >                          /* have to reacquire the folio after it got split */
> >                          folio_unlock(src_folio);
> >                          folio_put(src_folio);
> >                          src_folio = NULL;
> >                          goto retry;
> >                  }
> > }
> >
> > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > By the way, I have also reported that userfaultfd_move() has a fundamental
> > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > kernel. In this scenario, folios in the virtual zone won’t be split in
> > split_folio(). Instead, the large folio migrates into nr_pages small folios.
>  > > Thus, the best-case scenario would be:
> >
> > mTHP -> migrate to small folios in split_folio() -> move small folios to
> > dst_addr
> >
> > While this works, it negates the performance benefits of
> > userfaultfd_move(), as it introduces two PTE operations (migration in
> > split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > allocations, and still requires one memcpy(). This could end up
> > performing even worse than userfaultfd_copy(), I guess.
>  > > The worst-case scenario would be failing to allocate small folios in
> > split_folio(), then userfaultfd_move() might return -ENOMEM?
>
> Although that's an Android problem and not an upstream problem, I'll
> note that there are other reasons why the split / move might fail, and
> user space either must retry or fallback to a COPY.
>
> Regarding mTHP, we could move the whole folio if the user space-provided
> range allows for batching over multiple PTEs (nr_ptes), they are in a
> single VMA, and folio_mapcount() == nr_ptes.
>
> There are corner cases to handle, such as moving mTHPs such that they
> suddenly cross two page tables I assume, that are harder to handle when
> not moving individual PTEs where that cannot happen.

This is a useful suggestion. I’ve heard that Lokesh is also interested in
modifying ART to perform moves at the mTHP granularity, which would require
kernel modifications as well. It’s likely the direction we’ll take after
fixing the current urgent bugs. The current split_folio() really isn’t ideal.

The corner cases you mentioned are definitely worth considering. However,
once we can perform batch UFFDIO_MOVE, I believe that in most cases,
the conflict between userfaultfd_move() and TAO will be resolved ?

For those corner cases, ART will still need to be fully aware that falling
back to copy or retrying is necessary.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry