linux-kernel - Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d686f056-180c-4a22-a359-81eadb062629@bytedance.com>
Date: Fri, 5 Sep 2025 22:59:30 +0800
From: Fei Li <lifei.shirley@...edance.com>
To: Paolo Bonzini <pbonzini@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>
Cc: tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, liran.alon@...cle.com, hpa@...or.com,
 wanpeng.li@...mail.com, kvm@...r.kernel.org, x86@...nel.org,
 linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU
 states in KVM_SET_VCPU_EVENTS


On 8/29/25 12:44 AM, Paolo Bonzini wrote:
> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@...edance.com> wrote:
>> Actually this is a bug triggered by one monitor tool in our production
>> environment. This monitor executes 'info registers -a' hmp at a fixed
>> frequency, even during VM startup process, which makes some AP stay in
>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>> extremely low probability, about 1~2 VM hangs per week.
>>
>> Considering other emulators, like cloud-hypervisor and firecracker maybe
>> also have similar potential race issues, I think KVM had better do some
>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>> for both of your comments. 🙂
> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
> similar cases, that of course would help understanding the situation
> better.
>
> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
> vCPUs have halted.
>
> Paolo
>
Hi Paolo and Sean,


Sorry for the late response, I have been a little busy with other things 
recently. The complete calling processes for the bad case are as follows:

`info registers -a` hmp per 2ms[1]      AP(vcpu1) thread[2]              
     BSP(vcpu0) send INIT/SIPI[3]

                                  [2]
                                  KVM: KVM_RUN and then
                           schedule() in kvm_vcpu_block() loop

[1]
for each cpu: cpu_synchronize_state
if !qemu_thread_is_self()
1. insert to cpu->work_list, and handle asynchronously
2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal

                       [2]
                       KVM: checks signal_pending, breaks loop and 
returns -EINTR
Qemu: break kvm_cpu_exec loop, run
   1. qemu_wait_io_event()
   => process_queued_cpu_work => cpu->work_list.func()
        e.i. do_kvm_cpu_synchronize_state() callback
        => kvm_arch_get_registers
             => kvm_get_mp_state /* KVM: get_mpstate also calls
            kvm_apic_accept_events() to handle INIT and SIPI */
        => cpu->vcpu_dirty = true;
   // end of qemu_wait_io_event

                                   [3]
                                   SeaBIOS: BSP enters non-root mode and 
runs reset_vector() in SeaBIOS.
                                            send INIT and then SIPI by 
writing APIC_ICR during smp_scan
                                   KVM: BSP(vcpu0) exits, then => 
handle_apic_write
                                        => kvm_lapic_reg_write => 
kvm_apic_send_ipi to all APs
                                        => for each AP: 
__apic_accept_irq, e.g. for AP(vcpu1)
                                             => case APIC_DM_INIT: 
apic->pending_events = (1UL << KVM_APIC_INIT)
                                                  (not kick the AP yet)
                                             => case APIC_DM_STARTUP: 
set_bit(KVM_APIC_SIPI, &apic->pending_events)
                                                  (not kick the AP yet)

   [2]
   2. kvm_cpu_exec()
   => if (cpu->vcpu_dirty):
      => kvm_arch_put_registers
         => kvm_put_vcpu_events
                       KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
  => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
       e.i. pending_events changes from 11b to 10b
  // end of kvm_vcpu_ioctl_x86_set_vcpu_events
Qemu: => after put_registers, cpu->vcpu_dirty = false;
         => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
                       KVM: KVM_RUN
  => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal
  /* But AP(vcpu1)'s mp_state will never change from 
KVM_MP_STATE_UNINITIALIZED
    to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
without handling INIT inside kvm_apic_accept_events(), considering BSP 
will never
    send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
    non-root mode */

                                   [3]
                                   SeaBIOS: waits CountCPUs == 
expected_cpus_count and loops forever
                                   e.i. the AP(vcpu1) stays: 
EIP=0000fff0 && CS =f000 ffff0000
                                         and BSP(vcpu0) appears 100% 
utilized as it is in a while loop.

As for other emulators (like cloud-hypervisor and firecracker), there is 
no interactive command like 'info registers -a'.
But sorry again that I haven't had time to check code to confirm whether 
they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)


Have a nice day, thanks
Fei