lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2979c40-0cf9-4238-9fb5-5cef6dd9f411@bytedance.com>
Date: Mon, 8 Sep 2025 22:55:07 +0800
From: Fei Li <lifei.shirley@...edance.com>
To: Paolo Bonzini <pbonzini@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, liran.alon@...cle.com, hpa@...or.com,
 wanpeng.li@...mail.com, kvm@...r.kernel.org, x86@...nel.org,
 linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU
 states in KVM_SET_VCPU_EVENTS


On 9/5/25 10:59 PM, Fei Li wrote:
>
> On 8/29/25 12:44 AM, Paolo Bonzini wrote:
>> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@...edance.com> 
>> wrote:
>>> Actually this is a bug triggered by one monitor tool in our production
>>> environment. This monitor executes 'info registers -a' hmp at a fixed
>>> frequency, even during VM startup process, which makes some AP stay in
>>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>>> extremely low probability, about 1~2 VM hangs per week.
>>>
>>> Considering other emulators, like cloud-hypervisor and firecracker 
>>> maybe
>>> also have similar potential race issues, I think KVM had better do some
>>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>>> for both of your comments. 🙂
>> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
>> similar cases, that of course would help understanding the situation
>> better.
>>
>> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
>> vCPUs have halted.
>>
>> Paolo
>>
> Hi Paolo and Sean,
>
>
> Sorry for the late response, I have been a little busy with other 
> things recently. The complete calling processes for the bad case are 
> as follows:
>
> `info registers -a` hmp per 2ms[1]      AP(vcpu1) thread[2]           
> BSP(vcpu0) send INIT/SIPI[3]
>
>                                  [2]
>                                  KVM: KVM_RUN and then
>                           schedule() in kvm_vcpu_block() loop
>
> [1]
> for each cpu: cpu_synchronize_state
> if !qemu_thread_is_self()
> 1. insert to cpu->work_list, and handle asynchronously
> 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
>
>                       [2]
>                       KVM: checks signal_pending, breaks loop and 
> returns -EINTR
> Qemu: break kvm_cpu_exec loop, run
>   1. qemu_wait_io_event()
>   => process_queued_cpu_work => cpu->work_list.func()
>        e.i. do_kvm_cpu_synchronize_state() callback
>        => kvm_arch_get_registers
>             => kvm_get_mp_state /* KVM: get_mpstate also calls
>            kvm_apic_accept_events() to handle INIT and SIPI */
>        => cpu->vcpu_dirty = true;
>   // end of qemu_wait_io_event
>
>                                   [3]
>                                   SeaBIOS: BSP enters non-root mode 
> and runs reset_vector() in SeaBIOS.
>                                            send INIT and then SIPI by 
> writing APIC_ICR during smp_scan
>                                   KVM: BSP(vcpu0) exits, then => 
> handle_apic_write
>                                        => kvm_lapic_reg_write => 
> kvm_apic_send_ipi to all APs
>                                        => for each AP: 
> __apic_accept_irq, e.g. for AP(vcpu1)
>                                             => case APIC_DM_INIT: 
> apic->pending_events = (1UL << KVM_APIC_INIT)
>                                                  (not kick the AP yet)
>                                             => case APIC_DM_STARTUP: 
> set_bit(KVM_APIC_SIPI, &apic->pending_events)
>                                                  (not kick the AP yet)
>
>   [2]
>   2. kvm_cpu_exec()
>   => if (cpu->vcpu_dirty):
>      => kvm_arch_put_registers
>         => kvm_put_vcpu_events
>                       KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
>  => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
>       e.i. pending_events changes from 11b to 10b
>  // end of kvm_vcpu_ioctl_x86_set_vcpu_events
> Qemu: => after put_registers, cpu->vcpu_dirty = false;
>         => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
>                       KVM: KVM_RUN
>  => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 
> signal
>  /* But AP(vcpu1)'s mp_state will never change from 
> KVM_MP_STATE_UNINITIALIZED
>    to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
> without handling INIT inside kvm_apic_accept_events(), considering BSP 
> will never
>    send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
>    non-root mode */
>
>                                   [3]
>                                   SeaBIOS: waits CountCPUs == 
> expected_cpus_count and loops forever
>                                   e.i. the AP(vcpu1) stays: 
> EIP=0000fff0 && CS =f000 ffff0000
>                                         and BSP(vcpu0) appears 100% 
> utilized as it is in a while loop.
>
> As for other emulators (like cloud-hypervisor and firecracker), there 
> is no interactive command like 'info registers -a'.
> But sorry again that I haven't had time to check code to confirm 
> whether they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)
>
>
> Have a nice day, thanks
> Fei
>

By the way, this doesn't seem to be a Qemu bug, since calling "info 
registers -a" is allowed regardless of the vcpu state (including when 
the VM is in the bootloader). Thus the INIT should not be latched in 
this case. To fix this, I think we need add the 
kvm_apic_init_sipi_allowed() condition: only latch INITs in specific CPU 
states. Or change mp_state to KVM_MP_STATE_INIT_RECEIVED and then clear 
INIT here. Should I send a v2 patch with a clearer commit message?
Have a nice day, thanks
Fei

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ