[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z9OfqukbHB0lz72y@ly-workstation>
Date: Fri, 14 Mar 2025 11:16:58 +0800
From: "Lai, Yi" <yi1.lai@...ux.intel.com>
To: Yang Shi <yang@...amperecomputing.com>
Cc: Liam.Howlett@...cle.com, lorenzo.stoakes@...cle.com, vbabka@...e.cz,
jannh@...gle.com, oliver.sang@...el.com, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, yi1.lai@...el.com,
syzkaller-bugs@...glegroups.com
Subject: Re: [v2 PATCH] mm: vma: skip anonymous vma when inserting vma to
file rmap tree
On Wed, Mar 12, 2025 at 03:15:21PM -0700, Yang Shi wrote:
> LKP reported 800% performance improvement for small-allocs benchmark
> from vm-scalability [1] with patch ("/dev/zero: make private mapping
> full anonymous mapping") [2], but the patch was nack'ed since it changes
> the output of smaps somewhat.
>
> The profiling shows one of the major sources of the performance
> improvement is the less contention to i_mmap_rwsem.
>
> The small-allocs benchmark creates a lot of 40K size memory maps by
> mmap'ing private /dev/zero then triggers page fault on the mappings.
> When creating private mapping for /dev/zero, the anonymous VMA is
> created, but it has valid vm_file. Kernel basically assumes anonymous
> VMAs should have NULL vm_file, for example, mmap inserts VMA to the file
> rmap tree if vm_file is not NULL. So the private /dev/zero mapping
> will be inserted to the file rmap tree, this resulted in the contention
> to i_mmap_rwsem. But it is actually anonymous VMA, so it is pointless
> to insert it to file rmap tree.
>
> Skip anonymous VMA for this case. Over 400% performance improvement was
> reported [3].
>
> It is not on par with the 800% improvement from the original patch. It is
> because page fault handler needs to access some members of struct file
> if vm_file is not NULL, for example, f_mode and f_mapping. They are in
> the same cacheline with file refcount. When mmap'ing a file the file
> refcount is inc'ed and dec'ed, this caused bad cache false sharing
> problem. The further debug showed checking whether the VMA is anonymous
> or not can alleviate the problem. But I'm not sure whether it is the
> best way to handle it, maybe we should consider shuffle the layout of
> struct file.
>
> However it sounds rare that real life applications would create that
> many maps with mmap'ing private /dev/zero and share the same struct
> file, so the cache false sharing problem may be not that bad. But
> i_mmap_rwsem contention problem seems more real since all /dev/zero
> private mappings even from different applications share the same struct
> address_space so the same i_mmap_rwsem. Inserting anonymous VMA into
> file rmap tree is also a broken behavior. It is worth fixing from this
> perspective too.
>
> [1] https://lore.kernel.org/linux-mm/202501281038.617c6b60-lkp@intel.com/
> [2] https://lore.kernel.org/linux-mm/20250113223033.4054534-1-yang@os.amperecomputing.com/
> [3] https://lore.kernel.org/linux-mm/Z6RshwXCWhAGoMOK@xsang-OptiPlex-9020/#t
>
> Reported-by: kernel test robot <oliver.sang@...el.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
> ---
> v2:
> * Added the comments in code suggested by Lorenzo
> * Collected R-b from Lorenze
>
> mm/vma.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index c7abef5177cc..2fe99d181cfd 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1648,6 +1648,10 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
> void unlink_file_vma_batch_add(struct unlink_vma_file_batch *vb,
> struct vm_area_struct *vma)
> {
> + /* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
> + if (vma_is_anonymous(vma))
> + return;
> +
> if (vma->vm_file == NULL)
> return;
>
> @@ -1671,8 +1675,13 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
> */
> void unlink_file_vma(struct vm_area_struct *vma)
> {
> - struct file *file = vma->vm_file;
> + struct file *file;
> +
> + /* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
> + if (vma_is_anonymous(vma))
> + return;
>
> + file = vma->vm_file;
> if (file) {
> struct address_space *mapping = file->f_mapping;
>
> @@ -1684,9 +1693,14 @@ void unlink_file_vma(struct vm_area_struct *vma)
>
> void vma_link_file(struct vm_area_struct *vma)
> {
> - struct file *file = vma->vm_file;
> + struct file *file;
> struct address_space *mapping;
>
> + /* Rare, but e.g. /dev/zero sets vma->vm_file on an anon VMA */
> + if (vma_is_anonymous(vma))
> + return;
> +
> + file = vma->vm_file;
> if (file) {
> mapping = file->f_mapping;
> i_mmap_lock_write(mapping);
> --
> 2.48.1
>
Hi Yang Shi,
Greetings!
I used Syzkaller and found that there are two issues in v6.14-rc6 and were bisected to your patch as the first bad commit:
general protection fault in vma_interval_tree_insert_after
KASAN: slab-use-after-free Read in vma_interval_tree_insert
I see that you asked the patch to be dropped in maintainer's tree. I hope the issue dmesg can be insightful to you and the reproduction binary can be served to test your new design.
Issue one - general protection fault in vma_interval_tree_insert_after:
"
[ 26.488762] ? __rb_insert_augmented+0x7a/0x9d0
[ 26.489380] ? down_write+0x155/0x210
[ 26.489879] ? __pfx_down_write+0x10/0x10
[ 26.490444] vma_interval_tree_insert_after+0x2a2/0x370
[ 26.491190] copy_mm+0x11f6/0x2740
[ 26.491702] ? __pfx_copy_mm+0x10/0x10
[ 26.492242] ? _raw_spin_unlock_irqrestore+0x35/0x70
[ 26.492934] ? lockdep_hardirqs_on+0x89/0x110
[ 26.493559] ? __raw_spin_lock_init+0x44/0x120
[ 26.494201] copy_process+0x29d8/0x69c0
[ 26.494752] ? __pfx_copy_process+0x10/0x10
[ 26.495352] ? lock_is_held_type+0xef/0x150
[ 26.495947] ? __kasan_check_read+0x15/0x20
[ 26.496548] ? __lock_acquire+0x1bad/0x5d60
[ 26.497202] kernel_clone+0xfc/0x8c0
[ 26.497574] ? __pfx_kernel_clone+0x10/0x10
[ 26.497979] ? __pfx___lock_acquire+0x10/0x10
[ 26.498370] ? __pfx_do_mmap+0x10/0x10
[ 26.498722] __do_sys_clone+0xf5/0x140
[ 26.499074] ? __pfx___do_sys_clone+0x10/0x10
[ 26.499478] ? seqcount_lockdep_reader_access.constprop.0+0xc0/0xd0
[ 26.500051] ? __sanitizer_cov_trace_cmp4+0x1a/0x20
[ 26.500485] ? ktime_get_coarse_real_ts64+0xb6/0x100
[ 26.500961] __x64_sys_clone+0xc7/0x150
[ 26.501445] ? syscall_trace_enter+0x14d/0x280
[ 26.501985] x64_sys_call+0x1acf/0x2150
[ 26.502475] do_syscall_64+0x6d/0x140
[ 26.502956] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 26.503410] RIP: 0033:0x7f286523ee5d
[ 26.503739] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 26.505269] RSP: 002b:00007ffff77719f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 26.505908] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f286523ee5d
[ 26.506509] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 26.507116] RBP: 00007ffff7771a40 R08: 0000000000000000 R09: 0000000000000000
[ 26.507717] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffff7771bb8
[ 26.508309] R13: 000000000040183a R14: 0000000000403e08 R15: 00007f28655e3000
[ 26.508911] </TASK>
[ 26.509112] Modules linked in:
[ 26.509761] ---[ end trace 0000000000000000 ]---
[ 26.510167] RIP: 0010:__rb_insert_augmented+0x7a/0x9d0
[ 26.510615] Code: 89 e2 48 c1 ea 03 42 80 3c 32 00 0f 85 9c 05 00 00 4d 8b 2c 24 41 f6 c5 01 0f 85 88 01 00 00 4d 8d 45 08 4c 89 c2 48 c1 ea 03 <42> 80 3c 32 00 0f 85 95 05 00 00 4d 8b 7d 08 4d 39 e7 0f 84 78 01
[ 26.512126] RSP: 0018:ffff88801d53f8d0 EFLAGS: 00010202
[ 26.512569] RAX: ffffffff81d744d0 RBX: ffff888013c26970 RCX: ffff88800ed4ea80
[ 26.513192] RDX: 0000000000000001 RSI: 1ffff11002784d2e RDI: ffff888013c26970
[ 26.513787] RBP: ffff88801d53f918 R08: 0000000000000008 R09: ffffed1001da9d62
[ 26.514376] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888021604830
[ 26.514974] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff888021604838
[ 26.515568] FS: 00007f2865596740(0000) GS:ffff8880e368d000(0000) knlGS:0000000000000000
[ 26.516235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.516742] CR2: 0000000020000000 CR3: 000000001f1f4003 CR4: 0000000000770ef0
[ 26.517340] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 26.517930] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[ 26.518522] PKRU: 55555554
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_011927_vma_interval_tree_insert_after/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250313_011927_vma_interval_tree_insert_after/bzImage_eea255893718268e1ab852fb52f70c613d109b99
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250313_011927_vma_interval_tree_insert_after/eea255893718268e1ab852fb52f70c613d109b99_dmesg.log
Issue two - KASAN: slab-use-after-free Read in vma_interval_tree_insert
"
[ 18.362663] ==================================================================
[ 18.363058] BUG: KASAN: slab-use-after-free in vma_interval_tree_insert+0x3ac/0x460
[ 18.363448] Read of size 8 at addr ffff8880178025c8 by task repro/731
[ 18.363756]
[ 18.363850] CPU: 1 UID: 0 PID: 731 Comm: repro Not tainted 6.14.0-rc6-next-20250311-eea255893718 #1
[ 18.363858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 18.363865] Call Trace:
[ 18.363872] <TASK>
[ 18.363877] dump_stack_lvl+0xea/0x150
[ 18.363905] print_report+0xce/0x660
[ 18.363918] ? vma_interval_tree_insert+0x3ac/0x460
[ 18.363926] ? kasan_complete_mode_report_info+0x80/0x200
[ 18.363934] ? vma_interval_tree_insert+0x3ac/0x460
[ 18.363940] kasan_report+0xd6/0x110
[ 18.363946] ? vma_interval_tree_insert+0x3ac/0x460
[ 18.363955] __asan_report_load8_noabort+0x18/0x20
[ 18.363961] vma_interval_tree_insert+0x3ac/0x460
[ 18.363969] vma_prepare+0x23f/0x6b0
[ 18.363981] __split_vma+0x8df/0xe70
[ 18.363988] ? __pfx___split_vma+0x10/0x10
[ 18.363995] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 18.364007] ? mas_walk+0x6a7/0x8b0
[ 18.364016] vms_gather_munmap_vmas+0x17b/0xd40
[ 18.364024] __mmap_region+0x312/0x23e0
[ 18.364032] ? __pfx___mmap_region+0x10/0x10
[ 18.364039] ? __kasan_check_read+0x15/0x20
[ 18.364049] ? mark_lock.part.0+0xf2/0x17a0
[ 18.364063] ? __pfx_mark_lock.part.0+0x10/0x10
[ 18.364069] ? stack_trace_save+0x96/0xd0
[ 18.364094] ? __this_cpu_preempt_check+0x21/0x30
[ 18.364107] ? lock_is_held_type+0xef/0x150
[ 18.364114] mmap_region+0x1c0/0x3e0
[ 18.364120] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 18.364128] do_mmap+0xe0c/0x1270
[ 18.364137] ? __pfx_do_mmap+0x10/0x10
[ 18.364144] ? down_write_killable+0x163/0x250
[ 18.364152] ? __pfx_down_write_killable+0x10/0x10
[ 18.364157] ? __this_cpu_preempt_check+0x21/0x30
[ 18.364166] vm_mmap_pgoff+0x233/0x3d0
[ 18.364176] ? __pfx_vm_mmap_pgoff+0x10/0x10
[ 18.364182] ? __fget_files+0x204/0x3b0
[ 18.364196] ksys_mmap_pgoff+0x3dc/0x520
[ 18.364206] __x64_sys_mmap+0x139/0x1d0
[ 18.364218] x64_sys_call+0x200d/0x2150
[ 18.364226] do_syscall_64+0x6d/0x140
[ 18.364235] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 18.364241] RIP: 0033:0x7ff9b4c3ee5d
[ 18.364249] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 18.364255] RSP: 002b:00007ffdbb3b6718 EFLAGS: 00000216 ORIG_RAX: 0000000000000009
[ 18.364269] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff9b4c3ee5d
[ 18.364272] RDX: 0000000000000000 RSI: 0000000000001000 RDI: 0000000020ffc000
[ 18.364276] RBP: 00007ffdbb3b6740 R08: 0000000000000005 R09: 0000000000000000
[ 18.364279] R10: 0000000000000012 R11: 0000000000000216 R12: 00007ffdbb3b6898
[ 18.364283] R13: 000000000040181e R14: 0000000000403e08 R15: 00007ff9b4f19000
[ 18.364290] </TASK>
[ 18.364292]
[ 18.376442] Allocated by task 730:
[ 18.376616] kasan_save_stack+0x2c/0x60
[ 18.376812] kasan_save_track+0x18/0x40
[ 18.377004] kasan_save_alloc_info+0x3c/0x50
[ 18.377220] __kasan_slab_alloc+0x62/0x80
[ 18.377417] kmem_cache_alloc_noprof+0x13d/0x440
[ 18.377649] vm_area_alloc+0x29/0x180
[ 18.377839] __mmap_region+0xced/0x23e0
[ 18.378033] mmap_region+0x1c0/0x3e0
[ 18.378464] do_mmap+0xe0c/0x1270
[ 18.378807] vm_mmap_pgoff+0x233/0x3d0
[ 18.379188] ksys_mmap_pgoff+0x3dc/0x520
[ 18.379584] __x64_sys_mmap+0x139/0x1d0
[ 18.379968] x64_sys_call+0x200d/0x2150
[ 18.380353] do_syscall_64+0x6d/0x140
[ 18.380724] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 18.381219]
[ 18.381389] Freed by task 24:
[ 18.381694] kasan_save_stack+0x2c/0x60
[ 18.382191] kasan_save_track+0x18/0x40
[ 18.382387] kasan_save_free_info+0x3f/0x60
[ 18.382591] __kasan_slab_free+0x3d/0x60
[ 18.382784] slab_free_after_rcu_debug+0xdb/0x2b0
[ 18.383016] rcu_core+0x86b/0x1920
[ 18.383198] rcu_core_si+0x12/0x20
[ 18.383368] handle_softirqs+0x1c5/0x860
[ 18.383563] run_ksoftirqd+0x46/0x70
[ 18.383739] smpboot_thread_fn+0x666/0xa20
[ 18.383942] kthread+0x444/0x980
[ 18.384114] ret_from_fork+0x56/0x90
[ 18.384296] ret_from_fork_asm+0x1a/0x30
[ 18.384487]
[ 18.384572] Last potentially related work creation:
[ 18.384804] kasan_save_stack+0x2c/0x60
[ 18.384993] kasan_record_aux_stack+0x93/0xa0
[ 18.385211] kmem_cache_free+0x1b8/0x540
[ 18.385402] vm_area_free+0xa5/0xd0
[ 18.385584] remove_vma+0x135/0x180
[ 18.385763] vms_complete_munmap_vmas+0x432/0x810
[ 18.386000] __mmap_region+0x70c/0x23e0
[ 18.386193] mmap_region+0x1c0/0x3e0
[ 18.386377] do_mmap+0xe0c/0x1270
[ 18.386545] vm_mmap_pgoff+0x233/0x3d0
[ 18.386737] ksys_mmap_pgoff+0x3dc/0x520
[ 18.386937] __x64_sys_mmap+0x139/0x1d0
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250313_133334_vma_interval_tree_insert/bisect_info.log
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250313_133334_vma_interval_tree_insert/eea255893718268e1ab852fb52f70c613d109b99_dmesg.log
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@...alhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@...alhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
Powered by blists - more mailing lists