[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z+ymyiNlzJtM50gF@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Apr 2025 10:54:02 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock
to squash false positive
On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@...el.com>
>
> Use a separate subclass when acquiring KVM's per-CPU posted interrupts
> wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list
> of vCPUs to wake, to workaround a false positive deadlock.
>
> Chain exists of:
> &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> lock(&rq->__lock);
> lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
> lock(&p->pi_lock);
>
> *** DEADLOCK ***
>
> In the wakeup handler, the callchain is *always*:
>
> sysvec_kvm_posted_intr_wakeup_ipi()
> |
> --> pi_wakeup_handler()
> |
> --> kvm_vcpu_wake_up()
> |
> --> try_to_wake_up(),
>
> and the lock order is:
>
> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock.
>
> For the schedule out path, the callchain is always (for all intents and
> purposes; if the kernel is preemptible, kvm_sched_out() can be called from
> something other than schedule(), but the beginning of the callchain will
> be the same point in vcpu_block()):
>
> vcpu_block()
> |
> --> schedule()
> |
> --> kvm_sched_out()
> |
> --> vmx_vcpu_put()
> |
> --> vmx_vcpu_pi_put()
> |
> --> pi_enable_wakeup_handler()
>
> and the lock order is:
>
> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)
>
> I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for
> wakeup, and complains about the A=>C versus C=>A inversion. In practice,
> deadlock can't occur between schedule out and the wakeup handler as they
> are mutually exclusive. The entirely of the schedule out code that runs
> with the problematic scheduler locks held, does so with IRQs disabled,
> i.e. can't run concurrently with the wakeup handler.
>
> Use a subclass instead disabling lockdep entirely, and tell lockdep that
Paolo initially recommended utilizing the subclass.
Do you think it's good to add his suggested-by tag?
BTW: is it necessary to state the subclass assignment explicitly in the
patch msg? e.g.,
wakeup handler: subclass 0
sched_out: subclass 1
sched_in: subclasses 0 and 1
Aside from the minor nits, LGTM!
Thanks for polishing the patch and helping with the msg/comments :)
> both subclasses are being acquired when loading a vCPU, as the sched_out
> and sched_in paths are NOT mutually exclusive, e.g.
>
> CPU 0 CPU 1
> --------------- ---------------
> vCPU0 sched_out
> vCPU1 sched_in
> vCPU1 sched_out vCPU 0 sched_in
>
> where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup
> list+lock.
>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
> arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++---
> 1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 840d435229a8..51116fe69a50 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
> */
> static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
>
> +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING
> +
> static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> {
> return &(to_vmx(vcpu)->pi_desc);
> @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> * current pCPU if the task was migrated.
> */
> if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
> +
> + /*
> + * In addition to taking the wakeup lock for the regular/IRQ
> + * context, tell lockdep it is being taken for the "sched out"
> + * context as well. vCPU loads happens in task context, and
> + * this is taking the lock of the *previous* CPU, i.e. can race
> + * with both the scheduler and the wakeup handler.
> + */
> + raw_spin_lock(spinlock);
> + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_);
> list_del(&vmx->pi_wakeup_list);
> - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> + spin_release(&spinlock->dep_map, _RET_IP_);
> + raw_spin_unlock(spinlock);
> }
>
> dest = cpu_physical_id(cpu);
> @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>
> lockdep_assert_irqs_disabled();
>
> - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> + /*
> + * Acquire the wakeup lock using the "sched out" context to workaround
> + * a lockdep false positive. When this is called, schedule() holds
> + * various per-CPU scheduler locks. When the wakeup handler runs, it
> + * holds this CPU's wakeup lock while calling try_to_wake_up(), which
> + * can eventually take the aforementioned scheduler locks, which causes
> + * lockdep to assume there is deadlock.
> + *
> + * Deadlock can't actually occur because IRQs are disabled for the
> + * entirety of the sched_out critical section, i.e. the wakeup handler
> + * can't run while the scheduler locks are held.
> + */
> + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu),
> + PI_LOCK_SCHED_OUT);
> list_add_tail(&vmx->pi_wakeup_list,
> &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> --
> 2.49.0.472.ge94155a9ec-goog
>
Powered by blists - more mailing lists