[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2979c40-0cf9-4238-9fb5-5cef6dd9f411@bytedance.com>
Date: Mon, 8 Sep 2025 22:55:07 +0800
From: Fei Li <lifei.shirley@...edance.com>
To: Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <seanjc@...gle.com>
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, liran.alon@...cle.com, hpa@...or.com,
wanpeng.li@...mail.com, kvm@...r.kernel.org, x86@...nel.org,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU
states in KVM_SET_VCPU_EVENTS
On 9/5/25 10:59 PM, Fei Li wrote:
>
> On 8/29/25 12:44 AM, Paolo Bonzini wrote:
>> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@...edance.com>
>> wrote:
>>> Actually this is a bug triggered by one monitor tool in our production
>>> environment. This monitor executes 'info registers -a' hmp at a fixed
>>> frequency, even during VM startup process, which makes some AP stay in
>>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>>> extremely low probability, about 1~2 VM hangs per week.
>>>
>>> Considering other emulators, like cloud-hypervisor and firecracker
>>> maybe
>>> also have similar potential race issues, I think KVM had better do some
>>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>>> for both of your comments. 🙂
>> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
>> similar cases, that of course would help understanding the situation
>> better.
>>
>> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
>> vCPUs have halted.
>>
>> Paolo
>>
> Hi Paolo and Sean,
>
>
> Sorry for the late response, I have been a little busy with other
> things recently. The complete calling processes for the bad case are
> as follows:
>
> `info registers -a` hmp per 2ms[1] AP(vcpu1) thread[2]
> BSP(vcpu0) send INIT/SIPI[3]
>
> [2]
> KVM: KVM_RUN and then
> schedule() in kvm_vcpu_block() loop
>
> [1]
> for each cpu: cpu_synchronize_state
> if !qemu_thread_is_self()
> 1. insert to cpu->work_list, and handle asynchronously
> 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
>
> [2]
> KVM: checks signal_pending, breaks loop and
> returns -EINTR
> Qemu: break kvm_cpu_exec loop, run
> 1. qemu_wait_io_event()
> => process_queued_cpu_work => cpu->work_list.func()
> e.i. do_kvm_cpu_synchronize_state() callback
> => kvm_arch_get_registers
> => kvm_get_mp_state /* KVM: get_mpstate also calls
> kvm_apic_accept_events() to handle INIT and SIPI */
> => cpu->vcpu_dirty = true;
> // end of qemu_wait_io_event
>
> [3]
> SeaBIOS: BSP enters non-root mode
> and runs reset_vector() in SeaBIOS.
> send INIT and then SIPI by
> writing APIC_ICR during smp_scan
> KVM: BSP(vcpu0) exits, then =>
> handle_apic_write
> => kvm_lapic_reg_write =>
> kvm_apic_send_ipi to all APs
> => for each AP:
> __apic_accept_irq, e.g. for AP(vcpu1)
> => case APIC_DM_INIT:
> apic->pending_events = (1UL << KVM_APIC_INIT)
> (not kick the AP yet)
> => case APIC_DM_STARTUP:
> set_bit(KVM_APIC_SIPI, &apic->pending_events)
> (not kick the AP yet)
>
> [2]
> 2. kvm_cpu_exec()
> => if (cpu->vcpu_dirty):
> => kvm_arch_put_registers
> => kvm_put_vcpu_events
> KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
> => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
> e.i. pending_events changes from 11b to 10b
> // end of kvm_vcpu_ioctl_x86_set_vcpu_events
> Qemu: => after put_registers, cpu->vcpu_dirty = false;
> => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
> KVM: KVM_RUN
> => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1
> signal
> /* But AP(vcpu1)'s mp_state will never change from
> KVM_MP_STATE_UNINITIALIZED
> to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
> without handling INIT inside kvm_apic_accept_events(), considering BSP
> will never
> send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
> non-root mode */
>
> [3]
> SeaBIOS: waits CountCPUs ==
> expected_cpus_count and loops forever
> e.i. the AP(vcpu1) stays:
> EIP=0000fff0 && CS =f000 ffff0000
> and BSP(vcpu0) appears 100%
> utilized as it is in a while loop.
>
> As for other emulators (like cloud-hypervisor and firecracker), there
> is no interactive command like 'info registers -a'.
> But sorry again that I haven't had time to check code to confirm
> whether they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)
>
>
> Have a nice day, thanks
> Fei
>
By the way, this doesn't seem to be a Qemu bug, since calling "info
registers -a" is allowed regardless of the vcpu state (including when
the VM is in the bootloader). Thus the INIT should not be latched in
this case. To fix this, I think we need add the
kvm_apic_init_sipi_allowed() condition: only latch INITs in specific CPU
states. Or change mp_state to KVM_MP_STATE_INIT_RECEIVED and then clear
INIT here. Should I send a v2 patch with a clearer commit message?
Have a nice day, thanks
Fei
Powered by blists - more mailing lists