lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z+ymyiNlzJtM50gF@yzhao56-desk.sh.intel.com>
Date: Wed, 2 Apr 2025 10:54:02 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock
 to squash false positive

On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@...el.com>
> 
> Use a separate subclass when acquiring KVM's per-CPU posted interrupts
> wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list
> of vCPUs to wake, to workaround a false positive deadlock.
> 
>   Chain exists of:
>    &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                CPU1
>         ----                ----
>    lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
>                             lock(&rq->__lock);
>                             lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
>    lock(&p->pi_lock);
> 
>   *** DEADLOCK ***
> 
> In the wakeup handler, the callchain is *always*:
> 
>   sysvec_kvm_posted_intr_wakeup_ipi()
>   |
>   --> pi_wakeup_handler()
>       |
>       --> kvm_vcpu_wake_up()
>           |
>           --> try_to_wake_up(),
> 
> and the lock order is:
> 
>   &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock.
> 
> For the schedule out path, the callchain is always (for all intents and
> purposes; if the kernel is preemptible, kvm_sched_out() can be called from
> something other than schedule(), but the beginning of the callchain will
> be the same point in vcpu_block()):
> 
>   vcpu_block()
>   |
>   --> schedule()
>       |
>       --> kvm_sched_out()
>           |
>           --> vmx_vcpu_put()
>               |
>               --> vmx_vcpu_pi_put()
>                   |
>                   --> pi_enable_wakeup_handler()
> 
> and the lock order is:
> 
>   &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)
> 
> I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for
> wakeup, and complains about the A=>C versus C=>A inversion.  In practice,
> deadlock can't occur between schedule out and the wakeup handler as they
> are mutually exclusive.  The entirely of the schedule out code that runs
> with the problematic scheduler locks held, does so with IRQs disabled,
> i.e. can't run concurrently with the wakeup handler.
> 
> Use a subclass instead disabling lockdep entirely, and tell lockdep that
Paolo initially recommended utilizing the subclass.
Do you think it's good to add his suggested-by tag?

BTW: is it necessary to state the subclass assignment explicitly in the
patch msg? e.g.,

wakeup handler: subclass 0
sched_out: subclass 1
sched_in: subclasses 0 and 1

Aside from the minor nits, LGTM!
Thanks for polishing the patch and helping with the msg/comments :)

> both subclasses are being acquired when loading a vCPU, as the sched_out
> and sched_in paths are NOT mutually exclusive, e.g.
> 
>       CPU 0                 CPU 1
>   ---------------     ---------------
>   vCPU0 sched_out
>   vCPU1 sched_in
>   vCPU1 sched_out      vCPU 0 sched_in
> 
> where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup
> list+lock.
>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 840d435229a8..51116fe69a50 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
>   */
>  static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
>  
> +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING
> +
>  static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
>  {
>  	return &(to_vmx(vcpu)->pi_desc);
> @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>  	 * current pCPU if the task was migrated.
>  	 */
>  	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> -		raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +		raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
> +
> +		/*
> +		 * In addition to taking the wakeup lock for the regular/IRQ
> +		 * context, tell lockdep it is being taken for the "sched out"
> +		 * context as well.  vCPU loads happens in task context, and
> +		 * this is taking the lock of the *previous* CPU, i.e. can race
> +		 * with both the scheduler and the wakeup handler.
> +		 */
> +		raw_spin_lock(spinlock);
> +		spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_);
>  		list_del(&vmx->pi_wakeup_list);
> -		raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +		spin_release(&spinlock->dep_map, _RET_IP_);
> +		raw_spin_unlock(spinlock);
>  	}
>  
>  	dest = cpu_physical_id(cpu);
> @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
>  
>  	lockdep_assert_irqs_disabled();
>  
> -	raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +	/*
> +	 * Acquire the wakeup lock using the "sched out" context to workaround
> +	 * a lockdep false positive.  When this is called, schedule() holds
> +	 * various per-CPU scheduler locks.  When the wakeup handler runs, it
> +	 * holds this CPU's wakeup lock while calling try_to_wake_up(), which
> +	 * can eventually take the aforementioned scheduler locks, which causes
> +	 * lockdep to assume there is deadlock.
> +	 *
> +	 * Deadlock can't actually occur because IRQs are disabled for the
> +	 * entirety of the sched_out critical section, i.e. the wakeup handler
> +	 * can't run while the scheduler locks are held.
> +	 */
> +	raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu),
> +			     PI_LOCK_SCHED_OUT);
>  	list_add_tail(&vmx->pi_wakeup_list,
>  		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
>  	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> -- 
> 2.49.0.472.ge94155a9ec-goog
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ