lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFEAPOozHsR1/PLI@ly-workstation>
Date: Tue, 17 Jun 2025 13:42:20 +0800
From: "Lai, Yi" <yi1.lai@...ux.intel.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Matthew Wilcox <willy@...radead.org>,
	David Hildenbrand <david@...hat.com>,
	Pedro Falcato <pfalcato@...e.de>, Rik van Riel <riel@...riel.com>,
	Harry Yoo <harry.yoo@...cle.com>, Zi Yan <ziy@...dia.com>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	Nico Pache <npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
	Dev Jain <dev.jain@....com>, Jakub Matena <matenajakub@...il.com>,
	Wei Yang <richard.weiyang@...il.com>,
	Barry Song <baohua@...nel.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, yi1.lai@...el.com
Subject: Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via
 MREMAP_RELOCATE_ANON

On Mon, Jun 09, 2025 at 02:26:34PM +0100, Lorenzo Stoakes wrote:

Hi Lorenzo Stoakes,

Greetings!

I used Syzkaller and found that there is BUG: sleeping function called from invalid context in __relocate_anon_folios in linux-next next-20250616.

After bisection and the first bad commit is:
"
aaf5c23bf6a4 mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250617_015846___relocate_anon_folios/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250617_015846___relocate_anon_folios/bzImage_050f8ad7b58d9079455af171ac279c4b9b828c11
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250617_015846___relocate_anon_folios/050f8ad7b58d9079455af171ac279c4b9b828c11_dmesg.log

"
[   51.309319] BUG: sleeping function called from invalid context at ./include/linux/pagemap.h:1112
[   51.309788] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 670, name: repro
[   51.310130] preempt_count: 1, expected: 0
[   51.310316] RCU nest depth: 1, expected: 0
[   51.310502] 4 locks held by repro/670:a
[   51.310675]  #0: ffff88801360abe0 (&mm->mmap_lock){++++}-{4:4}, at: __do_sys_mremap+0x42e/0x1620
[   51.311098]  #1: ffff888011a2f078 (&anon_vma->rwsem/1){+.+.}-{4:4}, at: copy_vma_and_data+0x541/0x1790
[   51.311526]  #2: ffffffff8725c7c0 (rcu_read_lock){....}-{1:3}, at: ___pte_offset_map+0x3f/0x6c0
[   51.311929]  #3: ffff888013e8adf8 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: __pte_offset_map_lock+0x1a2/0x3c0
[   51.312375] Preemption disabled at:
[   51.312377] [<ffffffff81e14222>] __pte_offset_map_lock+0x1a2/0x3c0
[   51.312828] CPU: 0 UID: 0 PID: 670 Comm: repro Not tainted 6.16.0-rc2-next-20250616-050f8ad7b58d #1 PREEMPT(voluntary)
[   51.312837] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 044
[   51.312846] Call Trace:
[   51.312850]  <TASK>
[   51.312853]  dump_stack_lvl+0x121/0x150
[   51.312878]  dump_stack+0x19/0x20
[   51.312884]  __might_resched+0x37b/0x5a0
[   51.312900]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.312917]  __might_sleep+0xa3/0x170
[   51.312926]  ? vm_normal_folio+0x8c/0x170
[   51.312938]  __relocate_anon_folios+0xf97/0x2960
[   51.312953]  ? reacquire_held_locks+0xdd/0x1f0
[   51.312970]  ? __pfx___relocate_anon_folios+0x10/0x10
[   51.312982]  ? lock_set_class+0x17a/0x260
[   51.312994]  copy_vma_and_data+0x606/0x1790
[   51.313006]  ? percpu_counter_add_batch+0xd9/0x210
[   51.313028]  ? __pfx_copy_vma_and_data+0x10/0x10
[   51.313035]  ? vms_complete_munmap_vmas+0x525/0x810
[   51.313051]  ? __pfx_do_vmi_align_munmap+0x10/0x10
[   51.313064]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   51.313074]  ? mtree_range_walk+0x728/0xb70
[   51.313089]  ? __lock_acquire+0x412/0x22a0
[   51.313098]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[   51.313107]  ? percpu_counter_add_batch+0xd9/0x210
[   51.313114]  ? debug_smp_processor_id+0x20/0x30
[   51.313131]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313139]  ? lock_is_held_type+0xef/0x150
[   51.313149]  move_vma+0x689/0x1a60
[   51.313161]  ? __pfx_move_vma+0x10/0x10
[   51.313169]  ? cap_mmap_addr+0x58/0x140
[   51.313182]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[   51.313191]  ? security_mmap_addr+0x63/0x1b0
[   51.313203]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.313212]  ? __get_unmapped_area+0x1a4/0x440
[   51.313223]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[   51.313232]  ? vrm_set_new_addr+0x21d/0x2b0
[   51.313241]  __do_sys_mremap+0xeb4/0x1620
[   51.313251]  ? __pfx___do_sys_mremap+0x10/0x10
[   51.313261]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313276]  ? __this_cpu_preempt_check+0x21/0x30
[   51.313300]  __x64_sys_mremap+0xc7/0x150
[   51.313307]  ? syscall_trace_enter+0x14d/0x280
[   51.313320]  x64_sys_call+0x1933/0x2150
[   51.313332]  do_syscall_64+0x6d/0x2e0
[   51.313342]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   51.313349] RIP: 0033:0x7ff58583ee5d
[   51.313361] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 898
[   51.313367] RSP: 002b:00007ffd1f3a23e8 EFLAGS: 00000217 ORIG_RAX: 0000000000000019
[   51.313376] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff58583ee5d
[   51.313380] RDX: 0000000000002000 RSI: 0000000000002000 RDI: 000000002022e000
[   51.313384] RBP: 00007ffd1f3a23f0 R08: 000000002038d000 R09: 0000000000000001
[   51.313387] R10: 000000000000000f R11: 0000000000000217 R12: 00007ffd1f3a2508
[   51.313391] R13: 0000000000401126 R14: 0000000000403e08 R15: 00007ff585bab000
[   51.313403]  </TASK>
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@...alhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@...alhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

> A long standing issue with VMA merging of anonymous VMAs is the requirement
> to maintain both vma->vm_pgoff and anon_vma compatibility between merge
> candidates.
> 
> For anonymous mappings, vma->vm_pgoff (and consequently, folio->index)
> refer to virtual page offsets, that is, va >> PAGE_SHIFT.
> 
> However upon mremap() of an anonymous mapping that has been faulted (that
> is, where vma->anon_vma != NULL), we would then need to walk page tables to
> be able to access let alone manipulate folio->index, mapping fields to
> permit an update of this virtual page offset.
> 
> Therefore in these instances, we do not do so, instead retaining the
> virtual page offset the VMA was first faulted in at as it's vma->vm_pgoff
> field, and of course consequently folio->index.
> 
> On each occasion we use linear_page_index() to determine the appropriate
> offset, cleverly offset the vma->vm_pgoff field by the difference between
> the virtual address and actual VMA start.
> 
> Doing so in effect fragments the virtual address space, meaning that we are
> no longer able to merge these VMAs with adjacent ones that could, at least
> theoretically, be merged.
> 
> This also creates a difference in behaviour, often surprising to users,
> between mappings which are faulted and those which are not - as for the
> latter we adjust vma->vm_pgoff upon mremap() to aid mergeability.
> 
> This is problematic firstly because this proliferates kernel allocations
> that are pure memory pressure - unreclaimable and unmovable -
> i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist.
> 
> Secondly, mremap() exhibits an implicit uAPI in that it does not permit
> remaps which span multiple VMAs (though it does permit remaps that
> constitute a part of a single VMA).
> 
> This means that a user must concern themselves with whether merges succeed
> or not should they wish to use mremap() in such a way which causes multiple
> mremap() calls to be performed upon mappings.
> 
> This series provides users with an option to accept the overhead of
> actually updating the VMA and underlying folios via the
> MREMAP_RELOCATE_ANON flag.
> 
> If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in
> the mremap() succeeding, then no attempt is made at relocation of folios as
> this is not required.
> 
> Even if no merge is possible upon moving of the region, vma->vm_pgoff and
> folio->index fields are appropriately updated in order that subsequent
> mremap() or mprotect() calls will succeed in merging.
> 
> This flag falls back to the ordinary means of mremap() should the operation
> not be feasible. It also transparently undoes the operation, carefully
> holding rmap locks such that no racing rmap operation encounters incorrect
> or missing VMAs.
> 
> In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the
> user needs to know whether or not the operation succeeded - this flag is
> identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed,
> the mremap() fails with -EFAULT.
> 
> Note that no-op mremap() operations (such as an unpopulated range, or a
> merge that would trivially succeed already) will succeed under
> MREMAP_MUST_RELOCATE_ANON.
> 
> mremap() already walks page tables, so it isn't an order of magntitude
> increase in workload, but constitutes the need to walk to page table leaf
> level and manipulate folios.
> 
> The operations all succeed under THP and in general are compatible with
> underlying large folios of any size. In fact, the larger the folio, the
> more efficient the operation is.
> 
> Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is
> on the same order of magnitude of ordinary mremap() operations, with both
> exhibiting time to the proportion of the mapping which is populated.
> 
> Of course, mremap() operations that are entirely aligned are significantly
> faster as they need only move a VMA and a smaller number of higher order
> page tables, but this is unavoidable.
> 
> Previous efforts in this area
> =============================
> 
> An approach addressing this issue was previously suggested by Jakub Matena
> in a series posted a few years ago in [0] (and discussed in a masters
> thesis).
> 
> However this was a more general effort which attempted to always make
> anonymous mappings more mergeable, and therefore was not quite ready for
> the upstream limelight. In addition, large folio work which has occurred
> since requires us to carefully consider and account for this.
> 
> This series is more conservative and targeted (one must specific a flag to
> get this behaviour) and additionally goes to great efforts to handle large
> folios and account all of the nitty gritty locking concerns that might
> arise in current kernel code.
> 
> Thanks goes out to Jakub for his efforts however, and hopefully this effort
> to take a slightly different approach to the same problem is pleasing to
> him regardless :)
> 
> [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/
> 
> Use-cases
> =========
> 
> * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked
>   upon which makes use of extensive mremap() operations to perform
>   defragmentation of objects, taking advantage of the plentiful available
>   virtual address space in a 64-bit system.
> 
>   In instances where one VMA is faulted in and another not, merging is not
>   possible, which leads to significant, unreclaimable, kernel metadata
>   overhead and contention on the vm.max_map_count limit.
> 
>   This series eliminates the issue entirely.
> * It was indicated that Android similarly moves memory around and
>   encounters the very same issues as ZGC.
> * SUSE indicate they have encountered similar issues as pertains to an
>   internal client.
> 
> Past approaches
> ===============
> 
> In discussions at LSF/MM/BPF It was suggested that we could make this an
> madvise() operation, however at this point it will be too late to correctly
> perform the merge, requiring an unmap/remap which would be egregious.
> 
> It was further suggested that we simply defer the operation to the point at
> which an mremap() is attempted on multiple immediately adjacent VMAs (that
> is - to allow VMA fragmentation up until the point where it might cause
> perceptible issues with uAPI).
> 
> This is problematic in that in the first instance - you accrue
> fragmentation, and only if you were to try to move the fragmented objects
> again would you resolve it.
> 
> Additionally you would not be able to handle the mprotect() case, and you'd
> have the same issue as the madvise() approach in that you'd need to
> essentially re-map each VMA.
> 
> Additionally it would become non-trivial to correctly merge the VMAs - if
> there were more than 3, we would need to invent a new merging mechanism
> specifically for this, hold locks carefully over each to avoid them
> disappearing from beneath us and introduce a great deal of non-optional
> complexity.
> 
> While imperfect, the mremap flag approach seems the least invasive most
> workable solution (until further rework of the anon_vma mechanism can be
> achieved!)
> 
> Testing
> =======
> 
> * Significantly expanded self-tests, all of which are passing.
> * Explicit testing of forked cases including anon_vma reuse, all passing
>   correctly.
> * Ran all self tests with MREMAP_RELOCATE_ANON forced on for all anonymous
>   mremap()'s.
> * Ran heavy workloads with MREMAP_RELOCATE_ANON forced on on real hardware
>   (kernel compilation, etc.)
> * Ran stress-ng --mremap 32 for an hour with MREMAP_RELOCATE_ANON forced on
>   on real hardware.
> 
> Series History
> ==============
> 
> Non-RFC:
> * Rebased on mm-new and fixed merge conflicts, re-confirmed building and
>   all tests passing.
> * Seems to have settled down with all feedback previously raised addressed,
>   so un-RFC'd to propose the series for mainline, timed for the start of
>   the 6.16 rc cycle (thus targeting 6.17).
> 
> RFC v3:
> * Rebased on and fixed conflicts against mm-new.
> * Removed invalid use of folio_test_large_maybe_mapped_shared() in
>   __relocate_large_folio() - this has since been removed and inlined (see
>   [0]) anyway but we should be using folio_maybe_mapped_shared() here at
>   any rate.
> * Moved unnecessary folio large, ksm checks in __relocate_large_folio() to
>   relocate_large_folio() - we already check this in relocate_anon_pte() so
>   this is duplicated in that case.
> * Added new tests explicitly checking that MREMAP_MUST_RELOCATE_ANON fails
>   for forked processes, both forked children with parents as indicated by
>   avc, and forked parents with children.
> * Added anon_vma_assert_locked() helper.
> * Removed vma_had_uncowed_children() as it was incorrectly implemented (it
>   didn't account for grandchildren and descendents being not being
>   self-parented), and replaced with a general
>   vma_maybe_has_shared_anon_folios() function which checks both parent and
>   child VMAs. Wei raised a concern in this area, this helps clarify and
>   correct.
> * Converted anon_vma vs. mmap lock check in
>   vma_maybe_has_shared_anon_folios() to be more sensible and to assume the
>   caller hold sufficient locks (checked with assert).
> * Added additional recipients based on recent MAINTAINERS changes.
> * Added missing reference to Jakub's efforts in this area a few years ago
>   to cover letter. Thanks Jakub!
> https://lore.kernel.org/all/cover.1746305604.git.lorenzo.stoakes@oracle.com/
> 
> RFC v2:
> * Added folio_mapcount() check on relocate anon to assert exclusively
>   mapped as per Jann.
> * Added check for anon_vma->num_children > nr_pages in
>   should_relocate_anon() as per Jann.
> * Separated out vma_had_uncowed_parents() into shared helper function and
>   added vma_had_uncowed_children() to implement the above.
> * Add comment clarifying why we do not require an rmap lock on the old VMA
>   due to fork requiring an mmap write lock which we hold.
> * Corrected error path on __anon_vma_prepare() in copy_vma() as per Jann.
> * Checked for folio pinning and abort if in place. We do so, because this
>   implies the folio is being used by the kernel for a time longer than the
>   time over which an mmap lock is held (which will not be held at the time
>   of us manipulating the folio, as we hold the mmap write lock). We are
>   manipulating mapping, index fields and being conservative (additionally
>   mirroring what UFFDIO_MOVE does), we cannot assume that whoever holds the
>   pin isn't somehow relying on these not being manipulated. As per David.
> * Propagated mapcount, maybe DMA pinned checks to large folio logic.
> * Added folio splitting - on second thoughts, it would be a bit silly to
>   simply disallow the request because of large folio misalignment, work
>   around this by splitting the folio in this instance.
> * Added very careful handling around rmap lock, making use of
>   folio_anon_vma(), to ensure we do not deadlock on anon_vma.
> * Prefer vm_normal_folio() to vm_normal_page() & page_folio().
> * Introduced has_shared_anon_vma() to de-duplicate shared anon_vma check.
> * Provided sys_mremap() helper in vm_util.[ch] to be shared among test
>   callers and de-duplicate. This must be a raw system call, as glibc will
>   otherwise filter the flags.
> * Expanded the mm CoW self-tests to explicitly test with
>   MREMAP_RELOCATE_ANON for partial THP pages. This is useful as it
>   exercises split_folio() code paths explicitly. Additionally some cases
>   cannot succeed, so we also exercise undo paths.
> * Added explicit lockdep handling to teach it that we are handling two
>   distinct anon_vma locks so it doesn't spuriously report a deadlock.
> * Updated anon_vma deadlock checks to check anon_vma->root. Shouldn't
>   strictly be necessary as we explicitly limit ourselves to unforked
>   anon_vma's, but it is more correct to do so, as this is where the lock is
>   located.
> * Expanded the split_huge_page_test.c test to also test using the
>   MREMAP_RELOCATE_ANON flag, this is useful as it exercises the undo path.
> https://lore.kernel.org/all/cover.1745307301.git.lorenzo.stoakes@oracle.com/
> 
> RFC v1:
> https://lore.kernel.org/all/cover.1742478846.git.lorenzo.stoakes@oracle.com/
> 
> Lorenzo Stoakes (11):
>   mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON
>   mm/mremap: add MREMAP_MUST_RELOCATE_ANON
>   mm/mremap: add MREMAP[_MUST]_RELOCATE_ANON support for large folios
>   tools UAPI: Update copy of linux/mman.h from the kernel sources
>   tools/testing/selftests: add sys_mremap() helper to vm_util.h
>   tools/testing/selftests: add mremap() cases that merge normally
>   tools/testing/selftests: add MREMAP_RELOCATE_ANON merge test cases
>   tools/testing/selftests: expand mremap() tests for
>     MREMAP_RELOCATE_ANON
>   tools/testing/selftests: have CoW self test use MREMAP_RELOCATE_ANON
>   tools/testing/selftests: test relocate anon in split huge page test
>   tools/testing/selftests: add MREMAP_RELOCATE_ANON fork tests
> 
>  include/linux/rmap.h                          |    4 +
>  include/uapi/linux/mman.h                     |    8 +-
>  mm/internal.h                                 |    1 +
>  mm/mremap.c                                   |  719 ++++++-
>  mm/vma.c                                      |   77 +-
>  mm/vma.h                                      |   36 +-
>  tools/include/uapi/linux/mman.h               |    8 +-
>  tools/testing/selftests/mm/cow.c              |   23 +-
>  tools/testing/selftests/mm/merge.c            | 1690 ++++++++++++++++-
>  tools/testing/selftests/mm/mremap_test.c      |  262 ++-
>  .../selftests/mm/split_huge_page_test.c       |   25 +-
>  tools/testing/selftests/mm/vm_util.c          |    8 +
>  tools/testing/selftests/mm/vm_util.h          |    3 +
>  tools/testing/vma/vma.c                       |    5 +-
>  tools/testing/vma/vma_internal.h              |   38 +
>  15 files changed, 2732 insertions(+), 175 deletions(-)
> 
> --
> 2.49.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ