[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <07987fc3-5c47-4e77-956c-dae4bdf4bc2b@rbox.co>
Date: Sun, 4 Aug 2024 23:05:26 +0200
From: Michal Luczaj <mhal@...x.co>
To: Will Deacon <will@...nel.org>, Sean Christopherson <seanjc@...gle.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Paolo Bonzini <pbonzini@...hat.com>, Alexander Potapenko
<glider@...gle.com>, Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH] KVM: Fix error path in kvm_vm_ioctl_create_vcpu() on
xa_store() failure
On 8/1/24 14:41, Will Deacon wrote:
> On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote:
>> [...]
>> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and
>> vcpu_array, with no way of setting both atomically. Given that xa_store() should
>> never fail, I vote we do the simple thing and deliberately leak the memory.
>
> I'm inclined to agree. This conversation did momentarily get me worried
> about the window between the successful create_vcpu_fd() and the
> xa_store(), but it looks like 'kvm->online_vcpus' protects that.
>
> I'll spin a v2 leaking the vCPU, then.
But perhaps you're right. The window you've described may be an issue.
For example:
static u64 get_time_ref_counter(struct kvm *kvm)
{
...
vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL
tsc = kvm_read_l1_tsc(vcpu, rdtsc());
return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
+ hv->tsc_ref.tsc_offset;
}
u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
{
return vcpu->arch.l1_tsc_offset +
kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
}
After stuffing msleep() between fd install and vcpu_array store:
[ 125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38
[ 125.296203] #PF: supervisor read access in kernel mode
[ 125.296266] #PF: error_code(0x0000) - not-present page
[ 125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0
[ 125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19
[ 125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm]
[ 125.297376] Call Trace:
[ 125.297430] <TASK>
[ 125.297919] get_time_ref_counter+0x70/0x90 [kvm]
[ 125.298039] kvm_hv_get_msr_common+0xc1/0x7d0 [kvm]
[ 125.298150] __kvm_get_msr+0x72/0xf0 [kvm]
[ 125.298421] do_get_msr+0x16/0x50 [kvm]
[ 125.298531] msr_io+0x9d/0x110 [kvm]
[ 125.298626] kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm]
[ 125.299345] kvm_vcpu_ioctl+0x6cc/0x920 [kvm]
[ 125.299540] __x64_sys_ioctl+0x90/0xd0
[ 125.299582] do_syscall_64+0x93/0x180
[ 125.300206] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 125.300243] RIP: 0033:0x7f2d64aded2d
So, is get_time_ref_counter() broken (with a trivial fix) or should it be
considered a regression after commit afb2acb2e3a3
("KVM: Fix vcpu_array[0] races")?
Note that KASAN build, after null ptr oops, reports:
[ 3528.449742] BUG: KASAN: slab-use-after-free in mutex_can_spin_on_owner+0x18b/0x1b0
[ 3528.449884] Read of size 4 at addr ffff88814a040034 by task a.out/1240
[ 3528.450135] CPU: 6 UID: 1000 PID: 1240 Comm: a.out Tainted: G D 6.11.0-rc1+ #20
[ 3528.450289] Tainted: [D]=DIE
[ 3528.450412] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 3528.450551] Call Trace:
[ 3528.450677] <TASK>
[ 3528.450802] dump_stack_lvl+0x68/0x90
[ 3528.450940] print_report+0x174/0x4f6
[ 3528.451074] ? __virt_addr_valid+0x208/0x410
[ 3528.451205] ? mutex_can_spin_on_owner+0x18b/0x1b0
[ 3528.451337] kasan_report+0xb9/0x190
[ 3528.451469] ? mutex_can_spin_on_owner+0x18b/0x1b0
[ 3528.451606] mutex_can_spin_on_owner+0x18b/0x1b0
[ 3528.451737] __mutex_lock+0x1e3/0x1010
[ 3528.451871] ? kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm]
[ 3528.452321] ? __pfx___mutex_lock+0x10/0x10
[ 3528.452456] ? __pfx_lock_release+0x10/0x10
[ 3528.452642] ? __pfx___mutex_unlock_slowpath+0x10/0x10
[ 3528.452794] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 3528.452928] ? kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm]
[ 3528.453303] kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm]
[ 3528.453663] kvm_vm_ioctl+0x1b73/0x21b0 [kvm]
[ 3528.454025] ? mark_lock+0xe2/0x1530
[ 3528.454160] ? __pfx_kvm_vm_ioctl+0x10/0x10 [kvm]
[ 3528.454543] ? __pfx_mark_lock+0x10/0x10
[ 3528.454686] ? __pfx_lock_release+0x10/0x10
[ 3528.454826] ? schedule+0xe8/0x3b0
[ 3528.454970] ? __lock_acquire+0xd68/0x5e20
[ 3528.455114] ? futex_wait_setup+0xb2/0x190
[ 3528.455252] ? __entry_text_end+0x1543/0x10260d
[ 3528.455385] ? __pfx___lock_acquire+0x10/0x10
[ 3528.455542] ? __pfx_futex_wake_mark+0x10/0x10
[ 3528.455676] ? __pfx_do_vfs_ioctl+0x10/0x10
[ 3528.455826] ? find_held_lock+0x2d/0x110
[ 3528.455959] ? lock_release+0x44b/0x770
[ 3528.456090] ? __pfx_futex_wait+0x10/0x10
[ 3528.456222] ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
[ 3528.456368] ? __fget_files+0x1d6/0x340
[ 3528.456508] __x64_sys_ioctl+0x130/0x1a0
[ 3528.456641] do_syscall_64+0x93/0x180
[ 3528.456772] ? lockdep_hardirqs_on_prepare+0x16d/0x400
[ 3528.456907] ? do_syscall_64+0x9f/0x180
[ 3528.457034] ? lockdep_hardirqs_on+0x78/0x100
[ 3528.457164] ? do_syscall_64+0x9f/0x180
[ 3528.457292] ? lock_release+0x44b/0x770
[ 3528.457425] ? __pfx_lock_release+0x10/0x10
[ 3528.457560] ? lockdep_hardirqs_on_prepare+0x16d/0x400
[ 3528.457694] ? do_syscall_64+0x9f/0x180
[ 3528.457821] ? lockdep_hardirqs_on+0x78/0x100
[ 3528.457950] ? do_syscall_64+0x9f/0x180
[ 3528.458081] ? clear_bhb_loop+0x45/0xa0
[ 3528.458210] ? clear_bhb_loop+0x45/0xa0
[ 3528.458341] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3528.458475] RIP: 0033:0x7f2457d4ed2d
[ 3528.458609] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 3528.458783] RSP: 002b:00007ffd616de5d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3528.458927] RAX: ffffffffffffffda RBX: 00007ffd616de778 RCX: 00007f2457d4ed2d
[ 3528.459062] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 0000000000000004
[ 3528.459195] RBP: 00007ffd616de620 R08: 00000000004040c0 R09: 0000000000000001
[ 3528.459327] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 3528.459459] R13: 0000000000000000 R14: 00007f2457e77000 R15: 0000000000403e00
[ 3528.459604] </TASK>
[ 3528.459843] Allocated by task 1240:
[ 3528.459968] kasan_save_stack+0x1e/0x40
[ 3528.460100] kasan_save_track+0x10/0x30
[ 3528.460230] __kasan_slab_alloc+0x85/0x90
[ 3528.460357] kmem_cache_alloc_node_noprof+0x12c/0x360
[ 3528.460488] copy_process+0x372/0x8470
[ 3528.460617] kernel_clone+0xa6/0x620
[ 3528.460744] __do_sys_clone3+0x109/0x140
[ 3528.460873] do_syscall_64+0x93/0x180
[ 3528.460998] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 3528.461246] Freed by task 0:
[ 3528.461368] kasan_save_stack+0x1e/0x40
[ 3528.461499] kasan_save_track+0x10/0x30
[ 3528.461628] kasan_save_free_info+0x37/0x70
[ 3528.461757] poison_slab_object+0x109/0x180
[ 3528.461875] __kasan_slab_free+0x2e/0x50
[ 3528.461988] kmem_cache_free+0x17d/0x450
[ 3528.462108] delayed_put_task_struct+0x16a/0x1f0
[ 3528.462226] rcu_do_batch+0x368/0xd50
[ 3528.462342] rcu_core+0x6d5/0xb60
[ 3528.462458] handle_softirqs+0x1b4/0x770
[ 3528.462572] __irq_exit_rcu+0xbb/0x1c0
[ 3528.462683] irq_exit_rcu+0xa/0x30
[ 3528.462786] sysvec_apic_timer_interrupt+0x9d/0xc0
[ 3528.462891] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 3528.463095] Last potentially related work creation:
[ 3528.463198] kasan_save_stack+0x1e/0x40
[ 3528.463305] __kasan_record_aux_stack+0xad/0xc0
[ 3528.463411] __call_rcu_common.constprop.0+0xae/0xe80
[ 3528.463519] __schedule+0xfd8/0x5ee0
[ 3528.463622] schedule_idle+0x52/0x80
[ 3528.463725] do_idle+0x25e/0x3d0
[ 3528.463828] cpu_startup_entry+0x50/0x60
[ 3528.463925] start_secondary+0x201/0x280
[ 3528.464023] common_startup_64+0x13e/0x141
[ 3528.464209] Second to last potentially related work creation:
[ 3528.464306] kasan_save_stack+0x1e/0x40
[ 3528.464396] __kasan_record_aux_stack+0xad/0xc0
[ 3528.464485] task_work_add+0x1bd/0x270
[ 3528.464574] sched_tick+0x2c0/0x9d0
[ 3528.464662] update_process_times+0xd5/0x130
[ 3528.464753] tick_nohz_handler+0x1ae/0x4a0
[ 3528.464839] __hrtimer_run_queues+0x164/0x880
[ 3528.464923] hrtimer_interrupt+0x2f1/0x7f0
[ 3528.465007] __sysvec_apic_timer_interrupt+0xfd/0x3d0
[ 3528.465091] sysvec_apic_timer_interrupt+0x98/0xc0
[ 3528.465173] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 3528.465352] The buggy address belongs to the object at ffff88814a040000
which belongs to the cache task_struct of size 13128
[ 3528.465457] The buggy address is located 52 bytes inside of
freed 13128-byte region [ffff88814a040000, ffff88814a043348)
[ 3528.465629] The buggy address belongs to the physical page:
[ 3528.465712] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a040
[ 3528.465802] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 3528.465888] memcg:ffff88812c785601
[ 3528.465963] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[ 3528.466045] page_type: 0xfdffffff(slab)
[ 3528.466120] raw: 0017ffffc0000040 ffff8881002cea00 ffffea00041e4000 dead000000000004
[ 3528.466198] raw: 0000000000000000 0000000080020002 00000001fdffffff ffff88812c785601
[ 3528.466275] head: 0017ffffc0000040 ffff8881002cea00 ffffea00041e4000 dead000000000004
[ 3528.466351] head: 0000000000000000 0000000080020002 00000001fdffffff ffff88812c785601
[ 3528.466428] head: 0017ffffc0000003 ffffea0005281001 ffffffffffffffff 0000000000000000
[ 3528.466504] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[ 3528.466579] page dumped because: kasan: bad access detected
[ 3528.466717] Memory state around the buggy address:
[ 3528.466789] ffff88814a03ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3528.466860] ffff88814a03ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 3528.466931] >ffff88814a040000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 3528.467001] ^
[ 3528.467069] ffff88814a040080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 3528.467135] ffff88814a040100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 3528.467201] ==================================================================
Powered by blists - more mailing lists